CN101599137A

CN101599137A - Autonomous operant conditioning reflex automat and the application in realizing intelligent behavior

Info

Publication number: CN101599137A
Application number: CNA2009100892633A
Authority: CN
Inventors: 阮晓钢; 戴丽珍; 蔡建羡; 陈静; 郜园园
Original assignee: Beijing University of Technology
Current assignee: Beijing University of Technology
Priority date: 2009-07-15
Filing date: 2009-07-15
Publication date: 2009-12-09

Abstract

Autonomous operant conditioning reflex automat and the application in realizing intelligent behavior belong to bionical field.A kind of autonomous operant conditioning reflex automat AOC relates to a kind of discrete calculation machine model of describing autonomous formula automaton, mainly comprise: operational set, state set, " condition-operation " regular collection, observable state transitions, and operant conditioning reflex law of learning, and, defined behavior entropy, stipulated the recurrence working procedure of AOC based on AOC state orientation value.The key character of AOC is to simulate biological operant conditioning reflex mechanism, thereby has a bionical self organizing function, comprise self study and adaptation function, can be used for describing, simulation designs various self-organizing systems, especially, should be applied to describe simulation, the various intelligent behaviors of planing machine robot system.

Description

Autonomous operation conditional reflex automata and application thereof in realization of intelligent behaviors

Technical Field

The invention relates to an automaton, in particular to a bionic automaton based on an operation conditioned reflex principle.

Background

Automaton models for learning systems have been developed in 1960s and are referred to as learning automata, and over the last few years, it has been essential to change the structure of learning automata to meet different application requirements, typically both input and output. The invention is a self-organizing system based on the theory of Stefan gold operation conditioned reflex, and has self-learning and self-adapting functions. Since the 20 th century, the study of animals was initiated and two forms of study were proposed: one is classical conditioned reflex learning to shape the response behavior of an organism; another is operative conditioned reflex learning to shape the operational behavior of the organism. The western scientist believes that these two types of reflections are two distinct coupling processes: classical conditioned reflex is the process of S- -R ligation; operative conditioned reflex is the R- -S coupling process.

In recent decades, the academic interest of autonomous systems has increased year by year, and the total amount of literature associated with autonomous systems has increased year by year. The invention is an autonomous automaton, different from a non-autonomous automaton, the output of which does not need the drive of an external instruction and is made by the automaton according to the self-requirement. Related patents are as follows: a picture generating method named as a customer operation type automaton and the customer operation type automaton with application number of 98115560.X, a regular expression matching acceleration method named as a finite automaton with memory determination based on application number of 200710071071.0 and the like are all used for realizing a certain function by intersecting the automaton with an external environment. At present, an autonomous operation conditional reflection automaton has not been available.

The invention provides an abstract self-organizing model based on a Stefan operating conditioned reflex theory, which is used for describing, simulating and designing various self-organizing systems to show self-learning and self-adaptive characteristics.

Disclosure of Invention

The invention provides an autonomous operation conditional reflection automaton which can be used for describing, simulating and designing self-organization (including self-learning and self-adaption).

The operating conditional reflection automaton of the present invention is a nine-tuple comprising: an input symbol set, an internal state set, an internal operation set, an output symbol set, a random 'condition-operation' rule set, a state transition unit, an observation unit, a state orientation unit, and an operation conditional reflection learning unit, and a recursive execution procedure of the AOC is specified. The AOC is characterized in that the AOC simulates the operation conditioned reflex mechanism of organisms, thereby having bionic self-organization functions, including self-learning and self-adaption functions, and being used for describing, simulating and designing various self-organization systems with interaction functions.

A general finite state automaton is a five-tuple: FA ═ a, Z, S, f, g }. Where a represents a finite input symbol set, S represents a finite (internal) state symbol set (S (0) ∈ S being the initial state), Z represents a finite output (acceptance state) symbol set, f: s × A → S denotes the state transition function, g: s → Z represents the output function. The finite state automata FA is a non-autonomous system.

The AOC and finite state automata FA appear to be unequivalent in the sense that the operator symbols in the AOC are not equivalent to the input symbols in the finite state automata FA, which represent the internal operations of the AOC, and the input symbols in the FA represent the external instructions. The set of operation symbols Ω in the AOC is not the set of input symbols in the FA, but the internal operation of the AOC. The set of input symbols in the FA is actually the set of instructions that may be input externally. There is no output symbol set in AOC, and naturally there is no output function. As an autonomous system, an AOC requires a set of output symbols and an output function. The autonomous system may or may also need to act on the environment or the objective world. From the form of the state space equation, the output is a combination of states, or a combination of states and operations, and therefore, it can be said that the internal state set of the AOC is itself a set of output symbols, and the state of the AOC is observable; by "the state of the AOC is observable" is meant that the AOC itself has a receptor and is able to detect changes in its state, and does not mean that the outside world is able to observe these quantities; the autonomous automaton also needs output, the output does not need to be driven by an external instruction, and the autonomous automaton is made according to the needs of the autonomous automaton.

Compared with a non-autonomous automaton, the autonomous automaton has the advantages that the output of the autonomous automaton does not need to be driven by an external instruction, the autonomous automaton can perform certain action on the environment according to the self requirement, namely, even if the external environment changes, the autonomous automaton can still work normally, and the non-autonomous automaton needs to change a structural model or parameters to adapt to the change of the external environment. A non-autonomous system can always be converted into an autonomous system, and then an autonomous operation conditional reflection AOC can always be found to correspond to a corresponding non-autonomous operation conditional reflection automaton. Autonomous operation conditional reflection AOC are more widely used.

In information theory, entropy can be used as a measure of uncertainty for some event. The larger the information quantity is, the more regular the architecture is, the more perfect the function is, and the smaller the entropy is. By utilizing the concept of entropy, the measurement, transmission, transformation and storage of information are researched theoretically. The invention introduces the concept of the operation entropy to prove the convergence of the AOC operation entropy psi (t), and the self-organization process of the system is a process of absorbing information, a process of absorbing negative entropy and a process of eliminating uncertainty, so that the self-organization characteristic of the AOC is clarified, and the AOC has self-learning and self-adapting functions.

The invention provides an autonomous operation conditioned reflex automaton, which is used for simulating an animal experiment of Stefan to prove that the automaton realizes a mechanism of simulating operation conditioned reflex learning, and simultaneously realizes balance control of a two-wheeled self-balancing robot, which shows that AOC can be used for designing various intelligent behaviors of a robot system.

The automaton of the invention is a nine-tuple autonomous operation conditional reflection automaton:

AOC＝<t，Ω，S，Γ，δ，ε，η，ψ，s₀>

wherein

(1) Discrete time of AOC: t e {0, 1, 2, …, n _t0 is the starting time of AOC;

(2) set of operation symbols for AOC: q ═ α_k|k＝1，2，…，n_Ω}，α_kThe kth operation symbol of AOC;

(3) state set of AOC: s ═ S_i|i＝0，1，2，…，n_S}，s_iIs the ith state of the AOC;

(4) AOC operation rule set: Γ ═ r_ik(p)|p∈P；i∈{0，1，2，…，n_S}；k∈{0，1，2，…，n_Ω} random 'condition-operation' rule r_ik(p)：s_i→α_k(p) means that the AOC is in s state_iImplementing operation alpha according to probability P ∈ P under condition of ∈ S_k∈Ω，p＝p_ik＝p(α_k|s_i) I.e. AOC is in state s_iUnder the conditions of operation a_kP represents P_ikA set of (a);

(5) state transfer function of AOC: δ: s (t) × Ω (t) → S (t +1), the state S (t +1) ∈ S at the time of AOC t +1 being determined by the state S (t) ∈ S at the time t and the operation α (t) ∈ S at the time t, regardless of the state and operation before the time t; δ the state transition process determined is known or unknown, but the results of its state transition can be observed;

(6) orientation function of AOC: epsilon: s → E ═ ε_i|i＝0，1，2，…，n_S}，ε_i＝ε(s_i) E is as state s_iAn orientation value of S;

(7) operating conditions of AOCLearning law of shooting:

adjusting the operating rule r_ikThe probability of implementation of (P) ∈ Γ P: assuming that the state at time t is s (t), an operation α (t) ∈ Ω and the state s (t +1) observed at time t +1 are carried out, and according to the theory of conditioned operation of gold, if ∈ (s (t +1)) — ∈ (s (t))) < 0, p (α (t) | s (t)) tends to decrease, whereas if ∈ (s (t +1)) — ∈(s) (t)) > 0, p (α (t) | s (t)) tends to increase. At time t, the AOC is in the state s (t) s_iAnd the current selection operation α (t) ═ α_kAt the same time, according to the state transition transfer function, the state s (t +1) at the next moment is s_jSimulating the operating conditioned reflex mechanism of the living being, the probability of the current operation at the next moment, i.e. at the moment t +1, is changed, its value is increased by Δ, where Δ is related to the orientation value ∈, the larger the orientation value indicates the better result of the operation, and the larger Δ is, the probabilities of the remaining operations at the moment t +1 are correspondingly subtracted by a value, and the sum of the subtracted values is exactly Δ, and the probability of the operation at the last moment is the ratio of the sum of the operations (excluding the operation selected at the moment t) multiplied by Δ. This ensures that the sum of the probabilities of selecting the individual operations at each instant is 1. More formally described by the formula: when s (t) is equal to s_i、α(t)＝α_kAnd s (t +1) ═ s_jThen p_ik(t+1)＝p_ik(t) + Δ, the probability of other operations being denoted p_iu(t+1)＝p_iu(t) - Δ ξ, where u denotes 0 to n_ΩAny value between k is not equal to k. Wherein p is_ik(t) is AOC state at s_iE.s condition to perform operation alpha_kThe value of the probability of belonging to omega at the time t; p is a radical of_ik(t +1) is the AOC state at s_iE.s condition to perform operation alpha_kThe value of the probability of belonging to omega at the moment t + 1;

and 0. ltoreq. p_ik+Δ≤1；

<math> <mrow> <msub> <mover> <mi>ϵ</mi> <mo>&RightArrow;</mo> </mover> <mi>ij</mi> </msub> <mo>=</mo> <mi>ϵ</mi> <mrow> <mo>(</mo> <msub> <mi>s</mi> <mi>j</mi> </msub> <mo>)</mo> </mrow> <mo>-</mo> <mi>ϵ</mi> <mrow> <mo>(</mo> <msub> <mi>s</mi> <mi>i</mi> </msub> <mo>)</mo> </mrow> </mrow> </math>

I.e. the increment of the orientation value;

is a monotonically increasing function, satisfiesIf and only if x is 0; a is the learning rate;

<math> <mrow> <mi>ξ</mi> <mo>=</mo> <msub> <mi>p</mi> <mi>iu</mi> </msub> <mrow> <mo>(</mo> <mi>t</mi> <mo>)</mo> </mrow> <mo>/</mo> <munder> <mi>Σ</mi> <mrow> <mi>v</mi> <mo>&NotEqual;</mo> <mi>k</mi> </mrow> </munder> <msub> <mi>p</mi> <mi>iv</mi> </msub> <mrow> <mo>(</mo> <mi>t</mi> <mo>)</mo> </mrow> <mo>,</mo> </mrow> </math>

where v represents 0 to n_ΩAll values of k not equal to k in between,

indicates that the AOC state is at s_iE.s condition to perform operation alpha_uThe value of the sum of the probabilities belonging to omega at the time t; p is a radical of_iu(t) is AOC state at s_iE.s condition to perform operation alpha_uValue of the probability of e.omega at time t, p_iu(t +1) is the AOC state at s_iE.s condition to perform operation alpha_uThe value of the probability of e Ω at time t + 1.

(8) Operating entropy of AOC: psi: p × E → R⁺，R⁺Is a positive real number set, the operating entropy ψ (t) of the AOC at time t indicates that the state at time t is at s_iSum of operating entropies under conditions:

it is in the state s (t) s from time t_iThe set of operating probabilities and the set of orientation functions under the condition. Psi_i(t) is AOC atState s_iOperating entropy under conditions:

knowing the operation entropy of each state and weighting and summing the operation entropy to obtain the operation entropy of the AOC at the time t

If the operating entropy ψ (t) of the AOC becomes smaller and tends to be minimum at t → ∞, it is explained that the operating entropy ψ (t) of the AOC is convergent. The AOC is a self-organizing system based on the theory of Stefan-gold operation conditioned reflex, and has self-learning and self-adapting functions. The self-organizing process of the system is a process of absorbing information, a process of absorbing negative entropy and a process of eliminating uncertainty. To elucidate the self-organizing properties of AOCs, we need to demonstrate the convergence of the AOC operational entropy ψ (t).

(9) Initial state of AOC: s₀＝s(0)∈S。

The invention is characterized in that the operation conditioned reflex mechanism of the simulation organism has bionic self-organization function, including self-learning and self-adapting function, and can be used for describing, simulating and designing various self-organization systems.

The AOC of the autonomous operation conditional reflecting automaton of the invention operates recursively according to the following procedural steps:

(1) initialization: setting t to 0, randomly giving the initial state s (0) of the AOC, giving the learning rate a, and giving the initial operation probability p_ik(0)＝1/n_Ω(i＝0，1，2，…，n_S；k＝1，2，…，n_Ω) (ii) a Given down time T_f；

(2) Selecting operation: depends on the rule r in the "condition-operation" rule set Γ in the operation set Γ_ik(p)：s_i→α_k(p) i.e. AOC in its state is in s_iImplementing operation alpha according to probability P ∈ P under condition of ∈ S_k∈Ω，p＝p_ik＝p(α_k|s_i) Is that the AOC is in its state at s_iUnder the conditions of operation a_kRandomly selecting an operation alpha (t) ∈ omega with the AOC state being S (t) ∈ S;

(3) the implementation operation is as follows: at time t, the AOC is in a state S (t) e S, the operation α (t) e Ω selected in the previous step is performed, and the current state is shifted by δ (S (t)), α (t)) - δ (S (t))_i，α_k)；

(4) And (3) observing the state: according to the state transfer function of AOC: δ: s (t) × Ω (t) → S (t +1), the result of the state transition being fully observable, i.e., the presence of j ∈ {0, 1, 2, …, n_SSuch that s (t +1) is s_j；

(5) Operating conditions reflection: when the operation is performed at time t, not only the state of the AOC is shifted but also the probability of performing each operation at the next time is changed, and the learning law is reflected according to the operation conditions

Adjusting the operating rule r_ikThe implementation probability P ∈ P of (P) ∈ Γ. time t s (t) s_iAnd α (t) ═ α_kThen the probability of operation at time t +1 depends on

And (6) updating. Wherein,and 0. ltoreq. p_ik+Δ≤1；

I.e. the increment of the orientation value; a is the learning rate;

<math> <mrow> <mi>ξ</mi> <mo>=</mo> <msub> <mi>p</mi> <mi>iu</mi> </msub> <mrow> <mo>(</mo> <mi>t</mi> <mo>)</mo> </mrow> <mo>/</mo> <munder> <mi>Σ</mi> <mrow> <mi>v</mi> <mo>&NotEqual;</mo> <mi>k</mi> </mrow> </munder> <msub> <mi>p</mi> <mi>iv</mi> </msub> <mrow> <mo>(</mo> <mi>t</mi> <mo>)</mo> </mrow> <mo>;</mo> </mrow> </math>

(6) calculating an operation entropy: formulation according to defined operational entropy

Calculating the operation entropy at the time t, wherein p(s)_i) Is AOC state s_iThe value of the probability of occurrence of e S at time t, p (α)_k|s_i) Is AOC state at s_iE.s condition to perform operation alpha_kThe value of the probability e omega at time t.

(7) And (3) recursive transition: if T + 1. ltoreq.T_fThen t ═ t +1 and repeat (2) - (7);

(8) when T +1 > T_fAnd stopping the machine.

The flow chart of the method of the invention is shown in FIG. 2.

Drawings

FIG. 1 is a schematic structural diagram of an autonomous operation conditional reflection automaton according to the present invention;

t is the discrete time (1) and Ω is the operation α_k(k＝1，2，…，n_Ω) S is the state S_i(i＝0，1，2，…，n_S) Is a state transfer function (4), Γ is a "condition-operation" rule r_ik(i∈{0，1，2，…，n_S}；k∈{1，2，…，n_Ω}), ε is the orientation function (6), η is the conditioned reflex learning law (7), ψ is the behavioral entropy (8), s₀Is the initial state (9).

FIG. 2 is a flow chart of an AOC program of the autonomous operation conditional reflecting automaton;

FIG. 3 is a probability curve of the behavior of mouse;

FIG. 4 is a graph of the operating entropy of the mouse experiment;

FIG. 5 is an operation behavior probability curve of a machine pigeon;

FIG. 6 is an operation entropy curve of a machine pigeon experiment;

fig. 7 is a graph showing probability of each operation behavior of the two-wheeled self-balancing robot in an upright state, that is, when the deflection angle θ is 0 °;

FIG. 8 shows probability curves of operation behaviors of the two-wheeled self-balancing robot when the deflection angle is more than 0 degrees and less than 12 degrees;

fig. 9 is a graph showing probability of each operation behavior of the two-wheeled self-balancing robot when the deflection angle θ is 12 °;

FIG. 10 shows probability curves of operation behaviors of the two-wheeled self-balancing robot when the deflection angle is larger than-12 degrees and smaller than theta and smaller than 0 degree;

fig. 11 is a graph showing probability of each operation behavior of the two-wheeled self-balancing robot when the deflection angle θ is-12 °;

FIG. 12 is an operation entropy curve of a two-round self-balancing robot experiment;

examples

The first embodiment is as follows: one minimal system was a learning mouse, a rat experiment simulating scant. Briefly describe the mouse experiment with sper gold: a white mouse is placed in the Steiner box, and a lever is arranged, so that the structure of the box can eliminate all external stimuli as far as possible. The mouse can move freely in the box, when the lever is pressed, a group of food falls into a plate below the box, and the mouse can eat the food. A device is arranged outside the box to record the action of the animal. The mouse learns to press the lever continuously and obtains food reward through self action. This experiment was carried out byThe rat experiment of the Skinson is realized by autonomously operating a conditional reflex automaton. The white mouse has two operation behaviors, namely pressing the lever alpha₁And the other is a non-pressure lever alpha₂I.e. the set of operations Ω ═ { α ═ α₁，α₂And the probabilities are respectively represented by p1 and p 2. Its state set S ═ S₀，s₁}，s₀Indicating a state of hunger, s₁Indicating a non-starvation condition. The operation rule is as follows: Γ ═ r_ik(P) | P ∈ P; i belongs to {0, 1 }; k ∈ {0, 1} }, random "condition-operation" rule r_ik(p)：s_i→α_k(p) means that the AOC is in s state_iImplementing operation alpha according to probability P ∈ P under condition of ∈ S_k∈Ω，p＝p_ik＝p(α_k|s_i) I.e. AOC is in state s_iUnder the conditions of operation a_kThe probability value of (2). Its state transfer function: δ: s (t) × Ω (t) → S (t +1), in particular:

s₀×p₁→s₁，s₀×p₂→s₀，s₁×p₁→s₁，s₁×p₂→s₀. Its orientation function: epsilon: s → E ═ ε_i|i＝0，1}，ε_i＝ε(s_i) E is as state s_iOrientation value of S and defining

<math> <mrow> <mi>Δ</mi> <mo>=</mo> <mi>a</mi> <mo>×</mo> <msub> <mover> <mi>ϵ</mi> <mo>&RightArrow;</mo> </mover> <mi>ij</mi> </msub> <mo>×</mo> <mrow> <mo>(</mo> <mn>1</mn> <mo>-</mo> <msub> <mi>p</mi> <mn>1</mn> </msub> <mo>)</mo> </mrow> <mo>.</mo> </mrow> </math>

Where a is the learning rate,

is the increment of the orientation value. The probability of the two behaviors at the initial moment is 0.5, and the mouse can obtain reward as long as pressing the leverThe probability of pressing the lever at any time is increased, namely the probability of selectively pressing the lever by the white mouse at the next time is increased, and the probability reflects the learning law according to the operating conditions

Updating, repeatedly learning, and selecting the probability p of lever pressing by the white mouse₁Are becoming larger and larger. The learning rate a of the experiment is 0.01, and after 668 steps of learning, the mouse learns to press the lever to obtain food, and as can be easily seen from the attached figure 3, the probability p that the mouse presses the lever₁Eventually tending towards 1. During the course of the experiment, according to the formula of the defined operation entropy

The operation entropy at each time is calculated, and the operation entropy ψ (t) of the AOC becomes smaller and smaller with the passage of time and tends to be minimum at t → ∞, see fig. 4, which shows that the operation entropy ψ (t) of the AOC is convergent. The AOC is a self-organizing system based on the theory of Stefan-gold operation conditioned reflex, and has self-learning and self-adapting functions. The self-organizing process of the system is a process of absorbing information, a process of absorbing negative entropy and a process of eliminating uncertainty. Now that the convergence of the AOC operation entropy ψ (t) has been demonstrated, the self-organizing properties of AOC are also elucidated.

The specific implementation steps of the experiment are as follows:

(1) initialization: setting t to 0, randomly giving the initial state s (0) of the AOC, giving the learning rate a to 0.01, and giving the initial operation probability p_ik(0) 0.5 (i-0, 1; k-1, 2), given a downtime T_f＝1000；

(2) Selecting operation: according to rule Γ in "condition-operation" rule set Γ in operation set Γ ═ r { r ═ r_ik(P) | P ∈ P; i belongs to {0, 1 }; k ∈ {1, 2} }, random "condition-operation" rule r_ik(p)：s_i→α_k(p) i.e. AOC in its state is in s_iImplementing operation alpha according to probability P ∈ P under condition of ∈ S_k∈Ω，p＝p_ik＝p(α_k|s_i) Is that AOC is at s in the 0 state_iUnder the conditions of operation a_kRandomly selecting an operation alpha (t) ∈ omega with the AOC state being S (t) ∈ S;

(3) the implementation operation is as follows: at time t, AOC is in a state S (t) e S, the selected operation alpha (t) e omega in the last step is implemented, and the current state is determined according to delta: s (t) × Ω (t) → S (t +1), in particular:

s₀×p₁→s₁，s₀×p₂→s₀，s₁×p₁→s₁，s₁×p₂→s₀a transfer occurs;

(4) and (3) observing the state: according to the state transfer function of AOC: δ: s (t) × Ω (t) → S (t +1), the state transition process of which is known or unknown, but the result of which is fully observable, i.e., the presence of j ∈ {0, 1} causes S (t +1) ═ S_j；

And (6) updating. Wherein,

and 0. ltoreq. p_ik+Δ≤1；

I.e. the increment of the orientation value; a is the learning rate;

(8) when T +1 > T_fAnd stopping the machine.

Example two: the machine pigeon with learning ability simulates the pigeon experiment of the Skino. In this experiment, the pigeon was fed with food when pecking the red button (positive reinforcement stimulus), without any stimulus when pecking the yellow button, and given an electric shock when pecking the blue button (negative reinforcement stimulus), and the pigeon was randomized at the beginning with the red, yellow and blue buttons. After a while, the pigeon pecked the red button significantly more often than the other two buttons. Defining an autonomous operating condition of 3-operation 3-state for a machine pigeonReflection automaton having an operation set Ω ═ α₀，α₁，α₂Are elements of a red pecking button alpha respectively₀And a yellow pecking button alpha₁And a blue pecking button alpha₂The probabilities are represented by p0, p1, and p2, respectively. Set of states S ═ S₀，s₁，s₂I.e. zero starvation status (non-starvation status) s₀Semi-hungry state s₁Starvation state s₂The state transition rule is:

δ(s₀×α₀)＝s₀ δ(s₀×α₁)＝s₁ δ(s₀×α₂)＝s₁

δ(s₁×α₀)＝s₀ δ(s₁×α₁)＝s₂ δ(s₁×α₂)＝s₂

δ(s₂×α₀)＝s₁ δ(s₂×α₁)＝s₂ δ(s₂×α₂)＝s₂

is shown in tabular form in table 1 below. Its orientation function: epsilon: s → E ═ ε_i|i＝0，±0.5，±1}，ε_i＝ε(s_i) E is as state s_iOrientation value of S and defining

<math> <mrow> <mi>Δ</mi> <mo>=</mo> <mi>a</mi> <mo>×</mo> <msub> <mover> <mi>ϵ</mi> <mo>&RightArrow;</mo> </mover> <mi>ij</mi> </msub> <mo>×</mo> <mrow> <mo>(</mo> <mn>1</mn> <mo>-</mo> <msub> <mi>p</mi> <mn>0</mn> </msub> <mo>)</mo> </mrow> <mo>.</mo> </mrow> </math>

Orientation thereof: s₀→s₀: zero orientation

<math> <mrow> <mrow> <mo>(</mo> <msub> <mover> <mi>ϵ</mi> <mo>&RightArrow;</mo> </mover> <mn>00</mn> </msub> <mo>=</mo> <mn>0</mn> <mo>)</mo> </mrow> <mo>;</mo> </mrow> </math>

s₀→s₁: zero orientation

<math> <mrow> <mrow> <mo>(</mo> <msub> <mover> <mi>ϵ</mi> <mo>&RightArrow;</mo> </mover> <mn>01</mn> </msub> <mo>=</mo> <mn>0</mn> <mo>)</mo> </mrow> <mo>;</mo> </mrow> </math>

s₁→s₀: is in a positive orientation

<math> <mrow> <mrow> <mo>(</mo> <msub> <mover> <mi>ϵ</mi> <mo>&RightArrow;</mo> </mover> <mn>10</mn> </msub> <mo>=</mo> <mn>0.5</mn> <mo>)</mo> </mrow> <mo>;</mo> </mrow> </math>

s₁→s₂: negative orientation

<math> <mrow> <mrow> <mo>(</mo> <msub> <mover> <mi>ϵ</mi> <mo>&RightArrow;</mo> </mover> <mn>12</mn> </msub> <mo>=</mo> <mo>-</mo> <mn>0.5</mn> <mo>)</mo> </mrow> <mo>;</mo> </mrow> </math>

s₂→s₁: is in a positive orientation

<math> <mrow> <mrow> <mo>(</mo> <msub> <mover> <mi>ϵ</mi> <mo>&RightArrow;</mo> </mover> <mn>21</mn> </msub> <mo>=</mo> <mn>1.0</mn> <mo>)</mo> </mrow> <mo>;</mo> </mrow> </math>

s₂→s₂: negative orientation

<math> <mrow> <mrow> <mo>(</mo> <msub> <mover> <mi>ϵ</mi> <mo>&RightArrow;</mo> </mover> <mn>22</mn> </msub> <mo>=</mo> <mo>-</mo> <mn>1.0</mn> <mo>)</mo> </mrow> <mo>.</mo> </mrow> </math>

Reflection law of learning according to operating conditions

The current operation is rewarded

<math> <mrow> <mo>(</mo> <msub> <mover> <mi>ϵ</mi> <mo>&RightArrow;</mo> </mover> <mi>ij</mi> </msub> <mo>></mo> <mn>0</mn> <mo>)</mo> </mrow> </math>

The corresponding implementation probability tends to increase, and the implementation probabilities of other operations correspondingly decrease; the current operation is rewarded

<math> <mrow> <mo>(</mo> <msub> <mover> <mi>ϵ</mi> <mo>&RightArrow;</mo> </mover> <mi>ij</mi> </msub> <mo>=</mo> <mn>0</mn> <mo>)</mo> </mrow> </math>

The probability of all operations is not changed; the current operation is rewarded

<math> <mrow> <mo>(</mo> <msub> <mover> <mi>ϵ</mi> <mo>&RightArrow;</mo> </mover> <mi>ij</mi> </msub> <mo><</mo> <mn>0</mn> <mo>)</mo> </mrow> </math>

The corresponding implementation probability tends to decrease and the implementation probability of other operations increases accordingly. The initial probability of each operation is 1/3, and after about 5000 steps of learning, the machine pigeon basically pecks only the red button, but not the red button and the blue button, and it can be seen from fig. 5 that the probability p0 of the machine pigeon pecking the red button tends to 1, and the probability p1 of pecking the yellow button and the probability p2 of pecking the blue button tend to 0.

TABLE 1 State transition of machine pigeons

In the course of the experiment, each time is according to the formula of the defined operation entropy

The operation entropy is calculated, and the operation entropy ψ (t) of the AOC becomes smaller and smaller with the passage of time and tends to be minimum at t → ∞, see fig. 6, which shows that the AOC operation entropy ψ (t) is convergent. The AOC is a self-organizing system based on the theory of Stefan-gold operation conditioned reflex, and has self-learning and self-adapting functions. The self-organizing process of the system is a process of absorbing information, a process of absorbing negative entropy and a process of eliminating uncertainty. Now that the convergence of the AOC operation entropy ψ (t) has been demonstrated, the self-organizing properties of AOC are also elucidated.

The specific implementation steps of the experiment are as follows:

(1) initialization: setting t to 0, randomly giving the initial state s (0) of the AOC, giving the learning rate a to 0.01, and giving the initial operation probability p_ik(0) 1/3 (i-0, 1, 2; k-0, 1, 2); given down time T_f＝5000；

(2) Selecting operation: rule set of 'conditional-operation' in operation set gamma

Γ＝{r_ik(P) | P ∈ P; i belongs to {0, 1, 2 }; k ∈ {0, 1, 2} }, a rule r of random "condition-operations }_ik(p)：s_i→α_k(p) i.e. AOC in its state is in s_iImplementing operation alpha according to probability P ∈ P under condition of ∈ S_k∈Ω，p＝p_ik＝p(α_k|s_i) Is that the AOC is in its state at s_iUnder the conditions of operation a_kRandomly selecting an operation alpha (t) ∈ omega with the AOC state being S (t) ∈ S;

(3) the implementation operation is as follows: at time t, AOC is in a state S (t) e S to implement the operation α (t) e Ω selected in the previous step, and the current state δ: s (t) × Ω (t) → S (t +1), in particular:

δ(s₀×α₀)＝s₀ δ(s₀×α₁)＝s₁ δ(s₀×α₂)＝s₁

δ(s₁×α₀)＝s₀ δ(s₁×α₁)＝s₂ δ(s₁×α₂)＝s₂

δ(s₂×α₀)＝s₁ δ(s₂×α₁)＝s₂ δ(s₂×α₂)＝s₂

(4) and (3) observing the state: according to the state transfer function of AOC: δ: s (t) × Ω (t) → S (t +1), and although the state transition process is known or unknown, the result of the state transition is fully observable, i.e., there is a state transition

j ∈ {0, 1, 2} such that s (t +1) ═ s_j；

(5) Operating conditions reflection: when the operation is performed at time t, not only the state of the AOC is shifted but also the probability of performing each operation at the next time is changed, and the learning law is reflected according to the operation conditionsAdjusting the operating rule r_ikThe implementation probability P ∈ P of (P) ∈ Γ. time t s (t) s_iAnd α (t) ═ α_kThen the probability of operation at time t +1 depends on

And (6) updating. Wherein,

and 0. ltoreq. p_ik+Δ≤1；

I.e. the increment of the orientation value; a is the learning rate;

Calculating the operation entropy at the time t, wherein p(s)_i) Is AOC state s_iThe value of the probability of occurrence of e S at time t, p (α)_k|s_k) Is AOC state at s_iE.s condition to perform operation alpha_kThe value of the probability e omega at time t.

(8) when T +1 > T_fAnd stopping the machine.

Example three: and realizing the balance control of the two-wheeled self-balancing robot through an autonomous operation conditional reflection automaton. The two-wheeled upright robot can freely move left and right on the flat ground. When the deflection angle exceeds +/-12 degrees, the robot loses balance. The state set of the AOC automaton designed for the purpose is a robot deflection angle, and comprises 6 states: theta is 0 DEG, 0 DEG < theta < 12 DEG,Theta is 12 DEG, -12 DEG < theta < 0 DEG, -12 DEG, -theta > 12 DEG, respectively₀、s₁、s₂、s₃、s₄、s₅、s₆Thus, the state set S ═ { S ═ S₀，s₁，s₂，s₂，s₃，s₄，s₅，s₆,}. Its operating set Ω ═ { α ═ α₀，α₁，α₂Includes not moving alpha₀And to the right by alpha₁Leftward movement by α₂. The state transition rules are as follows:

δ(s₀×α₀)＝s₀ δ(s₀×α₁)＝s₃ δ(s₀×α₂)＝s₁

δ(s₁×α₀)＝s₂ δ(s₁×α₁)＝s₀ δ(s₁×α₂)＝s₂

δ(s₂×α₀)＝s₅ δ(s₂×α₁)＝s₁ δ(s₂×α₂)＝s₅

δ(s₃×α₀)＝s₄ δ(s₃×α₁)＝s₄ δ(s₃×α₂)＝s₀

δ(s₄×α₀)＝s₅ δ(s₄×α₁)＝s₅ δ(s₄×α₂)＝s₃

see table 2. Its orientation function: epsilon: s → E ═ ε_i|i＝0，±0.5，±1}，ε_i＝ε(s_i) E is as state s_iE.g. orientation value of S, at the same time

<math> <mrow> <mi>Δ</mi> <mo>=</mo> <mi>a</mi> <mo>×</mo> <msub> <mover> <mi>ϵ</mi> <mo>&RightArrow;</mo> </mover> <mi>ij</mi> </msub> <mo>×</mo> <mrow> <mo>(</mo> <mn>1</mn> <mo>-</mo> <msub> <mi>p</mi> <mi>ik</mi> </msub> <mo>)</mo> </mrow> <mo>.</mo> </mrow> </math>

Orientation thereof: s₀→s₀: is in a positive orientation

s₀→s₃: zero orientation

<math> <mrow> <mrow> <mo>(</mo> <msub> <mover> <mi>ϵ</mi> <mo>&RightArrow;</mo> </mover> <mn>03</mn> </msub> <mo>=</mo> <mn>0</mn> <mo>)</mo> </mrow> <mo>;</mo> </mrow> </math>

s₀→s₁: zero orientation

s₁→s₀: is in a positive orientation

<math> <mrow> <mrow> <mo>(</mo> <msub> <mover> <mi>ϵ</mi> <mo>&RightArrow;</mo> </mover> <mn>10</mn> </msub> <mo>=</mo> <mn>1.0</mn> <mo>)</mo> </mrow> <mo>;</mo> </mrow> </math>

s₁→s₂: negative orientation

s₂→s₁: is in a positive orientation

s₂→s₅: negative orientation

<math> <mrow> <mrow> <mo>(</mo> <msub> <mover> <mi>ϵ</mi> <mo>&RightArrow;</mo> </mover> <mn>25</mn> </msub> <mo>=</mo> <mo>-</mo> <mn>1.0</mn> <mo>)</mo> </mrow> <mo>;</mo> </mrow> </math>

s₃→s₄: negative orientation

<math> <mrow> <mrow> <mo>(</mo> <msub> <mover> <mi>ϵ</mi> <mo>&RightArrow;</mo> </mover> <mn>34</mn> </msub> <mo>=</mo> <mo>-</mo> <mn>0.5</mn> <mo>)</mo> </mrow> <mo>;</mo> </mrow> </math>

s₃→s₀: is in a positive orientation

<math> <mrow> <mrow> <mo>(</mo> <msub> <mover> <mi>ϵ</mi> <mo>&RightArrow;</mo> </mover> <mn>30</mn> </msub> <mo>=</mo> <mn>1.0</mn> <mo>)</mo> </mrow> <mo>;</mo> </mrow> </math>

s₄→s₅: negative orientation

<math> <mrow> <mrow> <mo>(</mo> <msub> <mover> <mi>ϵ</mi> <mo>&RightArrow;</mo> </mover> <mn>45</mn> </msub> <mo>=</mo> <mo>-</mo> <mn>1.0</mn> <mo>)</mo> </mrow> <mo>;</mo> </mrow> </math>

s₄→s₃: is in a positive orientation

<math> <mrow> <mrow> <mo>(</mo> <msub> <mover> <mi>ϵ</mi> <mo>&RightArrow;</mo> </mover> <mn>43</mn> </msub> <mo>=</mo> <mn>1.0</mn> <mo>)</mo> </mrow> <mo>.</mo> </mrow> </math>

Wherein p is_ikIndicating that the robot is in state s_iOperation a_kThe probability of (c). Reflection law of learning according to operating conditions

The probability is continuously updated. The initial probability is 1/3, and after learning of about 1500 steps, the robot can select good operation with probability close to 1 in each state, and keep self balance, and in the former 5 states, the robot can generally select good operation to make theta tend to 0 degrees, as can be seen from fig. 7-11. In the course of the experiment, each time is according to the formula of the defined operation entropy

The operation entropy is calculated, and the operation entropy ψ (t) of the AOC becomes smaller and smaller with the passage of time and tends to be minimum at t → ∞, see fig. 12, which shows that the AOC operation entropy ψ (t) is convergent. The AOC is a self-organizing system based on the theory of Stefan-gold operation conditioned reflex, and has self-learning and self-adapting functions. The self-organizing process of the system is a process of absorbing information, a process of absorbing negative entropy and a process of eliminating uncertainty. Now that the convergence of the AOC operation entropy ψ (t) has been demonstrated, the self-organizing properties of AOC are also elucidated.

TABLE 2 State transition and orientation mechanism for two-wheeled self-balancing robot

The specific implementation steps of the experiment are as follows:

(1) initialization: setting t to 0, randomly giving the initial state s (0) of the AOC, giving the learning rate a to 0.01, and giving the initial operation probability p_ik(0) 1/3 (i-0, 1, 2; k-0, 1, 2); given down time T_f＝1500；

(2) Selecting operation: by the "condition-operation" rule in the operation set Γ

Γ＝{r_ik(P) | P ∈ P; i belongs to {0, 1, 2, 3, 4 }; k ∈ {0, 1, 2} }, a rule r of random "condition-operations }_ik(p)：s_i→α_k(p) i.e. AOC in its state is in s_iImplementing operation alpha according to probability P ∈ P under condition of ∈ S_k∈Ω，p＝p_ik＝p(α_k|s_i) Is that the AOC is in its state at s_iUnder the conditions of operation a_kRandomly selecting an operation alpha (t) ∈ omega with the AOC state being S (t) ∈ S;

δ(s₀×α₀)＝s₀ δ(s₀×α₁)＝s₃ δ(s₀×α₂)＝s₁

δ(s₁×α₀)＝s₂ δ(s₁×α₁)＝s₀ δ(s₁×α₂)＝s₂

δ(s₂×α₀)＝s₅ δ(s₂×α₁)＝s₁ δ(s₂×α₂)＝s₅

δ(s₃×α₀)＝s₄ δ(s₃×α₁)＝s₄ δ(s₃×α₂)＝s₀

δ(s₄×α₀)＝s₅ δ(s₄×α₁)＝s₅ δ(s₄×α₂)＝s₃

a transfer occurs;

j ∈ {0, 1, 2, 3, 4} such that s (t +1) ═ s_j；

And (6) updating. Wherein,

and 0. ltoreq. p_ik+Δ≤1；

I.e. the increment of the orientation value; a is the learning rate;

here, the optimal operation in each state is different, so that the probabilities corresponding to the different operations in each state are calculated, and there are 15 probabilities in total.

(8) when T +1 > T_fAnd stopping the machine.

Claims

1. An autonomous operation conditional reflection automaton, AOC for short, is a nine-tuple:

wherein

(1) Discrete time of AOC: t e {0, 1, 2, …, n_t0 is the starting time of AOC;

(7) operating condition of AOC reflects learning law:

simulating the biological operating conditioned reflex mechanism, adjusting the operating rule r_ikThe probability of implementation of (p) ∈ Γ, assuming that the state at time t is s (t) ═ s_iCarrying out an operation α (t) ═ α_kE Ω, s (t +1) ═ s observed at time t +1_jThen the probability of operation at time t +1 depends on

Updating is carried out; wherein, p_ik(t) is AOC state at s_iE.s condition to perform operation alpha_kThe value of the probability of belonging to omega at the time t; p is a radical of_ik(t +1) is the AOC state at s_iE.s condition to perform operation alpha_kThe value of the probability of belonging to omega at the moment t + 1;and 0. ltoreq. p_ik+Δ≤1；

I.e. the increment of the orientation value;

is a monotonically increasing function, satisfies

If and only if x is 0; a is the learning rate;

where u represents 0 to n_ΩAny value between which k is not equal to k,

indicates that the AOC state is at s_iE.s condition to perform operation alpha_uE.g. the sum of the probabilities of Ω at time t, v denotes 0 to n_ΩAll values between which k is not equal; p is a radical of_iu(t) is AOC state at s_iE.s condition to perform operation alpha_uValue of the probability of e.omega at time t, p_iu(t +1) is the AOC state at s_iE.s condition to perform operation alpha_uThe value of the probability of belonging to omega at the moment t + 1;

it is in the state s (t) s from time t_iDetermining an operation probability set and an orientation function set under the condition; psi_i(t) is AOC in state s_iOperating entropy under conditions:

Wherein, p(s)_i) Is AOC state s_iThe value of the probability of occurrence of e S at time t, p (α)_k|s_i) Is AOC state at s_iE.s condition to perform operation alpha_kThe value of the probability of belonging to omega at the time t;

(9) initial state of AOC: s₀＝s(0)∈S。

2. Autonomous operating conditional reflecting automaton AOC according to claim 1, characterized in that it operates recursively according to the following procedural steps:

(2) Selecting operation: in the "condition-operation" rule set Γ of the operation set ΓRule r_ik(p)：s_i→α_k(p) i.e. AOC in its state is in s_iImplementing operation alpha according to probability P ∈ P under condition of ∈ S_k∈Ω，p＝p_ik＝p(α_k|s_i) Is that the AOC is in its state at s_iUnder the conditions of operation a_kRandomly selecting an operation alpha (t) ∈ omega with the AOC state being S (t) ∈ S;

Adjusting the operating rule r_ik(P) an implementation probability P ∈ P of Γ; time t s (t) s_iAnd α (t) ═ α_kThen the probability of operation at time t +1 depends on

Updating is carried out; wherein,

and 0. ltoreq. p_ik+Δ≤1；

I.e. the increment of the orientation value; a is the learning rate;

Calculating the operation entropy at the time t, wherein p(s)_i) Is AOC state s_iThe value of the probability of occurrence of e S at time t, p (α)_k|s_i) Is AOC state at s_iE.s condition to perform operation alpha_kThe value of the probability of belonging to omega at the time t;

(8) when T +1 > T_fAnd stopping the machine.