CN105205533B

CN105205533B - Development automatic machine and its learning method with brain Mechanism of Cognition

Info

Publication number: CN105205533B
Application number: CN201510628233.0A
Authority: CN
Inventors: 任红格; 史涛; 向迎帆; 李福进; 李冬梅; 霍美杰; 徐少彬; 刘为民; 张春磊; 尹瑞
Original assignee: North China University of Science and Technology
Current assignee: North China University of Science and Technology
Priority date: 2015-09-29
Filing date: 2015-09-29
Publication date: 2018-01-05
Anticipated expiration: 2035-09-29
Also published as: CN105205533A

Abstract

The present invention relates to the development automatic machine and its learning method with brain Mechanism of Cognition, belong to intelligent robot technology field.Development automatic machine with brain Mechanism of Cognition, including internal state set, system output set, built-in function behavior set, state transition equation, reward signal, system evaluation functions, system acting select probability, dopamine responsive to differential signal.Development automatic machine and its learning method provided by the invention with brain Mechanism of Cognition, framework provides that a kind of generalization ability is strong for system autonomous development process based on learning automaton, mathematical modeling applied widely；Sensorimotor system is combined by this method with intrinsic motivation mechanism, is improved self study and the adaptive ability of system, is realized intelligence truly.

Description

Development automatic machine and its learning method with brain Mechanism of Cognition

Technical field

The present invention relates to the development automatic machine and its learning method with brain Mechanism of Cognition, belong to intelligent robot technology Field.

Technical background

Learning and Memory is the essence of people and animal intelligence behavior, and a variety of technical ability of people and animal are all in its nervous system Gradually form and grow up during by self study and self-organizing, study and simulation people and animal nervous activity and Self-regulatory mechanism, and intelligent robot is given to, it is the important subject of artificial intelligence and control science.

1996, J.Weng proposed robot autonomous intelligence development thought earliest, and he thinks that intelligent body should simulate On the basis of human brain, interacted under the interior control in development program by sensor and effector with circumstances not known to develop intelligence Ability.Brooks etc. emphasizes that robot interacts study with teacher, environment and gradually develops its intelligence, and by with reference to neurology department Research theory proposes that the computation model in the regions such as prefrontal lobe in the cortex of simulation people and animal, hypothalamus, hippocampus comes Challenge in complex environment is handled, this has also related to sensorimotor system.Initial cognitive development is from sensorimotor What the formation and development of system coordination mechanism started, while sensorimotor system is the process for being formed and being developed in intrinsic motivation again In constantly coordinate and it is perfect.Neurology pertinent literature shows, during people and animal learning, cerebral cortex, basal nuclei And cerebellum can with itself distinctive method multiple operation, and in people's correlation relevant with animal movement, cerebellum and Basal nuclei is distributed in cerebral cortex to the both sides of the route of motor message transmission between spinal cord, and they can participate in any behavior act Initiation and control.

Related patent such as application number CN200910086990.4 patent of invention is based on automaton theory, it is proposed that operation Automaton model, and the model is applied in the autonomous learning control of robot.Application No. CN201310656943.5's Operant conditioning reflex principle is then applied to image processing field by patent, effectively raises the precision and speed of system process images Degree.The patent of Application No. 201410101272.0 is low mainly for traditional robot learning efficiency, the problems such as adaptability difference A kind of bionic intelligence control method is proposed, effectively raises intelligent robot level.Application No. 201410163756.8 A kind of autonomous intelligence development cloud robot system based on cloud computing is proposed, the system can effectively mitigate robot execution The burden of computing intensive task, the shared of different machines human world knowledge can also be realized.But above patent does not relate to Simulate the learning system of mankind's brain Mechanism of Cognition.

The content of the invention

For above-mentioned technical problem, the present invention is introduced into psychology using biological sensorimotor system as theoretical foundation Intrinsic motivation mechanism learns to drive, there is provided a kind of development automatic machine and its learning method with brain Mechanism of Cognition, improves machine The autonomous development cognitive ability of device people.

Development automatic machine with brain Mechanism of Cognition, including internal state set, system output set, built-in function behavior Set, state transition equation, reward signal, system evaluation functions, system acting select probability, dopamine responsive to differential signal；

(1) SC=[s₁,s₂,...s_j] it is expressed as limited internal state set, the sensation corresponded in cerebral cortex Cortex, s_jJ-th of state is represented, j is the number of internal state.

(2) MC=[y₁,y₂,...y_i] system output set is expressed as, the motor cortex corresponded in cerebral cortex, y_i I-th of output is represented, i represents the number of output.

(3)Cb_A=[a₁,a₂,...a_k] built-in function behavior set is expressed as, correspond to cerebellum region, a_kFor k-th Internal actions, k are the number of internal actions.

(4)f:S (t) × a (t) → s (t+1) is state transition equation, i.e., the state s (t+1) at t+1 moment is by t State s (t) and operation behavior a (t) are together decided on, and typically have environment or model to determine.

(5) r (t)=r (s (t), a (t)) is expressed as system internally state is the inside taken by s (t) in t The reward signal for making state be transferred to after s (t+1) after operation behavior a (t), the mound sent relative to thalamus are felt.

(6) input signal in cerebral cortex includes two parts, is sensory cortex information and motor cortex information respectively, makees For the input of corpus straitum, therefore：

CC={ SC, MC } (1)

Corpus straitum is mainly the evaluation mechanism for predicting organism operative orientation quality, is also furtherly intrinsic motivation machine The evaluation mechanism of tropism quality is produced, it is as follows to define system evaluation functions：

BG_strio(t)=r (t+1)+γ r (t+2)+γ²r(t+3)+... (2)

Wherein, γ ∈ [0,1] are discount factor；Due to reason existing for intrinsic motivation mechanism so that the evaluation letter of system Number BG_strioGradually 0 is leveled off to, so as to ensure that system is ultimately at stable state；It is the oriented nuclei in intrinsic motivation mechanism to define η The heart, major function are to instruct autonomous cognition direction；Definition orientation core η span is in [η_min,η_max] between, i.e. orientation Preferably between the functional value worst with orientation；The definition of intrinsic motivation orientation function is as shown in formula (3) so in corpus straitum：

Wherein λ is the parameter of orientation function, and the difference for defining the orientation function of two adjacent moments is θ (t)=η (t)-η (t-1), carry out the orientation degree of judgement system, if θ (t) ＞ 0, illustrate that t is bigger than the orientation value at t-1 moment, on the contrary θ (t) ＜ 0, illustrate to illustrate that t is smaller than the orientation value at t-1 moment.

(7) in the learning process of basal ganglion, the matrix in corpus straitum mainly acts selection function；By it is interior A most important feature selects execution to act exactly according to probability size in the learning process of motivational mechanism driving；Using The Boltzmann rule of probabilitys realize the action selection function of matrix, so as to realizing the probability selection mechanism of learning automaton, its The middle Boltzmann rule of probabilitys belong to known；Define first：

Wherein：M represents m-th of internal actions, and A represents the Boltzmann rule of probabilitys, p (a=a_k) expression action selection is generally Rate.

According to the definition in formula (4), by the system acting select probability output BG of corpus straitum matrix_matrix(s, a) come Substitute p (a=a_k) represent, formula (2) substitutes into formula (4) and obtains formula (5)：

Wherein, T is thermal constant, and the random degree of selection of expression action, the degree that the bigger explanations of T act selection is bigger, The degree that the opposite smaller explanations of T act selection is smaller；When T gradually goes to zero, then BG_strio(SC(t),a_k) corresponding to action Select probability gradually tends to 1, and T numerical value is gradually reduced over time in system, represents system experience in learning process Knowledge gradually increases, and is gradually evolved into a systems stabilisation from a unstable system；

(8) dopaminergic discharged by substantia nigra compacta be used as action assess instruct signal, for improve by The Behavior Expression of maximum following award caused by action, more accurate action is performed to obtain；At the t+1 moment by corpus straitum The evaluation function determined is：

BG_strio(t+1)=r (t+2)+γ r (t+3)+γ²r(t+4)+...(6)

Formula (7) can be drawn with reference to formula (2) and formula (6)：

BG_strio(t)=r (t+1)+γ BG_strio(t+1) (7)

This shows, in t, evaluation function BG_strio(t) the evaluation function BG at t+1 moment can be used_strio(t+1) come Represent, but due to the influence of the error present in prediction initial stage so that with evaluation of estimate BG_strio(t+1) BG is represented_strio(t) Value and actual value and unequal, so need to carry out in substantia nigra compacta by the award information of thalamus output and corpus straitum output Processing, and discharge dopaminergic SN_DPATo adjust the table of evaluation of estimate, dopamine responsive to differential signal is represented with formula (8)：

SN_DPA=r (t+1)+γ BG_strio(t+1)-BG_strio(t) (8)

The learning method of development automatic machine with brain Mechanism of Cognition, comprises the following steps：

(1) initialize：Iterative learning step number initial value t=0, iterative learning number are step_max, initialize parameters And synaptic weight, the then probability that initial internal operation behavior is performed when experiment starts are identical；

(2) current state SC (t) is perceived；

(3) evaluation function BG is calculated in corpus straitum_strio(t), due to the presence of intrinsic motivation mechanism, according to current BG_strio(t) value calculates orientation function η (t)；

(4) the action select probability BG of corpus straitum matrix is calculated according to formula according to orientation quality_matrix(s, a) and by Cerebellum execution action a (t)；

(5) according to state transition equation, state is by SC (t) → SC (t+1)；

(6) thalamus sends award r (t) immediately and triggers dopamine response regulation evaluation of estimate；

(7) by brain motor cortex output action y (t)；

(8) (2)~(7) are repeated until t=step_max；Study terminates.

Compared with prior art, development automatic machine and its learning method provided by the invention with brain Mechanism of Cognition, with Framework provides that a kind of generalization ability is strong for system autonomous development process based on learning automaton, mathematical modulo applied widely Type；Secondly sensorimotor system is combined by this method with intrinsic motivation mechanism, improves self study and the adaptive ability of system, Realize intelligence truly.

Brief description of the drawings

Fig. 1 is present system structure chart；

Fig. 2 is learning process figure of the present invention；

Fig. 3 is that the coaxial two wheels robot balance of embodiment controls each condition responsive curve；

Fig. 4 is the coaxial two wheels robot balance control evaluation function and error simulation curve of embodiment；

Fig. 5 is the interference--free experiments simulation result of embodiment；

Fig. 6 is the learning method of embodiment and traditional learning automaton method evaluation function curve comparison figure；

The learning method of Fig. 7 embodiments and traditional learning automaton method error curve comparison figure.

Embodiment

The invention will be further described with reference to the accompanying drawings and detailed description.

Using coaxial two wheels robot as embodiment, system construction drawing according to Fig. 2 step flow as shown in figure 1, learnt.

For incomplete formula double-wheel self-balancing robot, it is an intrinsic unstable system, various realizing Before motion, first have to ensure that robot can keep Equilibrium, so the posture balancing of coaxial two wheels robot is to be moved The most important condition of control.In order to verify a kind of validity of development automatic machine with brain Mechanism of Cognition proposed by the invention, Robustness and superiority, the present embodiment have studied how logical the robot under circumstances not known is using coaxial two wheels robot as object Cross autonomous learning and finally learn technical performance.

Robot has four output quantities in experimentation and meets corresponding conditionses, i.e. left and right two-wheeled angular speed θ_rAnd θ_l Less than 3.489rad/s, fuselage itself inclination alpha ＜ 0.1744rad and robot swing rod angular speed β ＜ 3.489rad/s.Discount because Sub- γ=0.9, sampling time 0.01s.In each experiment, when the number of attempt of robot is tasted more than 1000 times or once When the balance step number of examination is more than 20000 step, then stops the study of robot and restart another experiment.If robot exists It can also keep balancing after undergoing 20000 steps in wherein once attempting, then it is assumed that the technical ability of balance control has been learned by robot. After each the failure of an experiment, original state and each weights are reset to a range of random value again, then relearn.

Experiment 1：Balance control experiment

Robot, using method proposed by the present invention, by constantly study, passes through under the circumstances not known not interfered with 42 explorations simultaneously complete experiment in the 43rd exploration, take around 220 Walk of experience or so, i.e. 2.2s or so has just learned to balance Technical ability is controlled, is demonstrated by its faster independent learning ability and effectiveness of the invention, each shape of preceding 3000 step in simulation result State amount response curve and evaluation function and error simulation curve are as shown in Figure 3 and Figure 4.

Experiment 2：Interference--free experiments

In the actual running of system, input/output signal more or less can be disturbed by external noise, or Detection means it is inaccurate, quantity of state is produced certain error.So in order to simulate actual environment, when robot When keeping 9800 step after association's balance control, the pulse signal that amplitude is 25 is added in each input state amount, if machine Device people can be subjected to the interference of pulse signal and keep balancing, then it is assumed that Success in Experiment simultaneously proves that the present invention has certain robust Property.Fig. 5 is the output response for adding each state after pulse signal, it can be seen that by 200 steps, i.e. after 2s or so, and robot weight Newly reach equilbrium position.

Experiment 3：The present embodiment and traditional learning automaton contrast experiment

Because the present invention has introduced intrinsic motivation mechanism to drive the autonomous learning of robot, the mistake of system is advantageously reduced Difference, improve convergence of algorithm speed.In order to prove the superiority of the present invention, respectively using traditional learning automaton algorithm and Ben Fa It is bright that balance control experiment has been carried out to coaxial two wheels robot, and its experimental result is analyzed.The parameter of two kinds of algorithms in experiment Set identical, Fig. 6 and Fig. 7 are the comparison diagram of the evaluation function of two kinds of algorithms and error curve in preceding 2000 step.Can be with by Fig. 6 The present invention is found out in about 220 steps, i.e. 2.2s just completes the study of balance control technical ability, and traditional learning automaton method In about 600 steps, i.e. 6s, just complete study, it was demonstrated that convergence rate of the invention is better than traditional learning automaton method.Fig. 7 tables Bright error span of the invention is better than traditional learning automaton method, is more beneficial for the stabilization of system.

Claims

A kind of 1. development automatic machine with brain Mechanism of Cognition, it is characterised in that：Including internal state set, system output collection Close, built-in function behavior set, state transition equation, reward signal, system evaluation functions, system acting select probability, DOPA Amine responsive to differential signal；

(1) SC=[s₁,s₂,...s_j] it is expressed as limited internal state set, the sensory cortex corresponded in cerebral cortex, s_jJ-th of state is represented, j is the number of internal state；

(2) MC=[y₁,y₂,...y_i] system output set is expressed as, the motor cortex corresponded in cerebral cortex, y_iRepresent I-th of output, i represent the number of output；

(3)Cb_A=[a₁,a₂,...a_k] built-in function behavior set is expressed as, correspond to cerebellum region, a_kInside k-th Action, k are the number of internal actions；

(4)f:S (t) × a (t) → s (t+1) is state transition equation, i.e. the state s (t+1) at t+1 moment by t state s (t) together decide on operation behavior a (t), determined by environment or model；

(5) r (t)=r (s (t), a (t)) be expressed as system t internally state by the built-in function taken during s (t) The reward signal for making state be transferred to after s (t+1) after behavior a (t), the mound sent relative to thalamus are felt；

(6) input signal in cerebral cortex includes two parts, is sensory cortex information and motor cortex information respectively, as line The input of shape body, therefore：

CC={ SC, MC } (1)

Corpus straitum is mainly the evaluation mechanism for predicting organism operative orientation quality, and furtherly and intrinsic motivation mechanism takes The evaluation mechanism of tropism quality, it is as follows to define system evaluation functions：

BG_strio(t)=r (t+1)+γ r (t+2)+γ²r(t+3)+... (2)

Wherein, γ ∈ [0,1] are discount factor；Due to reason existing for intrinsic motivation mechanism so that the evaluation function of system BG_strioGradually 0 is leveled off to, so as to ensure that system is ultimately at stable state；It is the oriented nuclei in intrinsic motivation mechanism to define η The heart, major function are to instruct autonomous cognition direction；Definition orientation core η span is in [η_min,η_max] between, i.e. orientation Preferably between the functional value worst with orientation；The definition of intrinsic motivation orientation function is as shown in formula (3) so in corpus straitum：

<mrow> <mi>&eta;</mi> <mrow> <mo>(</mo> <mi>t</mi> <mo>)</mo> </mrow> <mo>=</mo> <mfrac> <mrow> <mn>1</mn> <mo>-</mo> <msup> <mi>e</mi> <mrow> <mo>-</mo> <msub> <mi>&lambda;BG</mi> <mrow> <mi>s</mi> <mi>t</mi> <mi>r</mi> <mi>i</mi> <mi>o</mi> </mrow> </msub> <mrow> <mo>(</mo> <mi>t</mi> <mo>)</mo> </mrow> </mrow> </msup> </mrow> <mrow> <mn>1</mn> <mo>+</mo> <msup> <mi>e</mi> <mrow> <mo>-</mo> <msub> <mi>&lambda;BG</mi> <mrow> <mi>s</mi> <mi>t</mi> <mi>r</mi> <mi>i</mi> <mi>o</mi> </mrow> </msub> <mrow> <mo>(</mo> <mi>t</mi> <mo>)</mo> </mrow> </mrow> </msup> </mrow> </mfrac> <mo>-</mo> <mo>-</mo> <mo>-</mo> <mrow> <mo>(</mo> <mn>3</mn> <mo>)</mo> </mrow> </mrow>

Wherein λ is the parameter of orientation function, and the difference for defining the orientation function of two adjacent moments is θ (t)=η (t)-η (t- 1), carry out the orientation degree of judgement system, if θ (t) ＞ 0, illustrate that t is bigger than the orientation value at t-1 moment, on the contrary θ (t) ＜ 0, illustrate that t is smaller than the orientation value at t-1 moment；

(7) in the learning process of basal ganglion, the matrix in corpus straitum mainly acts selection function；By intrinsic motivation A most important feature selects execution to act exactly according to probability size in the learning process of mechanism drives；Using The Boltzmann rule of probabilitys realize the action selection function of matrix, so as to realizing the probability selection mechanism of learning automaton, its The middle Boltzmann rule of probabilitys belong to known；Define first：

<mrow> <mtable> <mtr> <mtd> <mrow> <mi>A</mi> <mo>=</mo> <msub> <mi>Boltz</mi> <mi>T</mi> </msub> <mo>{</mo> <mi>E</mi> <mrow> <mo>(</mo> <mi>s</mi> <mo>,</mo> <msub> <mi>a</mi> <mi>k</mi> </msub> <mo>)</mo> </mrow> <mo>,</mo> <mi>k</mi> <mo>=</mo> <mn>1</mn> <mo>,</mo> <mn>2</mn> <mo>,</mo> <mn>....</mn> <mi>m</mi> <mo>}</mo> </mrow> </mtd> </mtr> <mtr> <mtd> <mrow> <mo>&DoubleLeftRightArrow;</mo> <mi>p</mi> <mrow> <mo>(</mo> <mi>a</mi> <mo>=</mo> <msub> <mi>a</mi> <mi>k</mi> </msub> <mo>)</mo> </mrow> <mo>=</mo> <mfrac> <msup> <mi>e</mi> <mfrac> <mrow> <mi>E</mi> <mrow> <mo>(</mo> <mi>s</mi> <mo>,</mo> <msub> <mi>a</mi> <mi>k</mi> </msub> <mo>)</mo> </mrow> </mrow> <mi>T</mi> </mfrac> </msup> <mrow> <munderover> <mo>&Sigma;</mo> <mrow> <mi>k</mi> <mo>=</mo> <mn>1</mn> </mrow> <mi>m</mi> </munderover> <msup> <mi>e</mi> <mfrac> <mrow> <mi>E</mi> <mrow> <mo>(</mo> <mi>s</mi> <mo>,</mo> <msub> <mi>a</mi> <mi>k</mi> </msub> <mo>)</mo> </mrow> </mrow> <mi>T</mi> </mfrac> </msup> </mrow> </mfrac> </mrow> </mtd> </mtr> </mtable> <mo>-</mo> <mo>-</mo> <mo>-</mo> <mrow> <mo>(</mo> <mn>4</mn> <mo>)</mo> </mrow> </mrow>

Wherein：M represents m-th of internal actions, and A represents the Boltzmann rule of probabilitys, p (a=a_k) expression action select probability；

According to the definition in formula (4), by the system acting select probability output BG of corpus straitum matrix_matrix(s, a) substitute P (a=a_k) represent, formula (2) substitutes into formula (4) and obtains formula (5)：

<mrow> <msub> <mi>BG</mi> <mrow> <mi>m</mi> <mi>a</mi> <mi>t</mi> <mi>r</mi> <mi>i</mi> <mi>x</mi> </mrow> </msub> <mrow> <mo>(</mo> <mi>s</mi> <mo>,</mo> <mi>a</mi> <mo>)</mo> </mrow> <mo>=</mo> <mfrac> <msup> <mi>e</mi> <mfrac> <mrow> <msub> <mi>BG</mi> <mrow> <mi>s</mi> <mi>t</mi> <mi>r</mi> <mi>i</mi> <mi>o</mi> </mrow> </msub> <mrow> <mo>(</mo> <mi>S</mi> <mi>C</mi> <mo>(</mo> <mi>t</mi> <mo>)</mo> <mo>,</mo> <msub> <mi>a</mi> <mi>k</mi> </msub> <mo>)</mo> </mrow> </mrow> <mi>T</mi> </mfrac> </msup> <mrow> <munderover> <mo>&Sigma;</mo> <mrow> <mi>k</mi> <mo>=</mo> <mn>1</mn> </mrow> <mi>m</mi> </munderover> <msup> <mi>e</mi> <mfrac> <mrow> <msub> <mi>BG</mi> <mrow> <mi>s</mi> <mi>t</mi> <mi>r</mi> <mi>i</mi> <mi>o</mi> </mrow> </msub> <mrow> <mo>(</mo> <mi>S</mi> <mi>C</mi> <mo>(</mo> <mi>t</mi> <mo>)</mo> <mo>,</mo> <msub> <mi>a</mi> <mi>k</mi> </msub> <mo>)</mo> </mrow> </mrow> <mi>T</mi> </mfrac> </msup> </mrow> </mfrac> <mo>-</mo> <mo>-</mo> <mo>-</mo> <mrow> <mo>(</mo> <mn>5</mn> <mo>)</mo> </mrow> </mrow>

Wherein, T is thermal constant, and the random degree of selection of expression action, the degree that the bigger explanations of T act selection is bigger, opposite T The degree that smaller explanation acts selection is smaller；When T gradually goes to zero, then BG_strio(SC(t),a_k) corresponding to action selection Probability gradually tends to 1, and T numerical value is gradually reduced over time in system, represents system Heuristics in learning process Gradually increase, and be gradually evolved into a systems stabilisation from a unstable system；

(8) dopaminergic discharged by substantia nigra compacta be used as action assess instruct signal, for improve by acting The Behavior Expression of caused maximum following award, more accurate action is performed to obtain；Determined at the t+1 moment by corpus straitum Fixed evaluation function is：

BG_strio(t+1)=r (t+2)+γ r (t+3)+γ²r(t+4)+... (6)

Formula (7) can be drawn with reference to formula (2) and formula (6)：

BG_strio(t)=r (t+1)+γ BG_strio(t+1) (7)

This shows, in t, evaluation function BG_strio(t) the evaluation function BG at t+1 moment can be used_strio(t+1) represent, But due to the influence of the error present in prediction initial stage so that with evaluation of estimate BG_strio(t+1) BG is represented_strio(t) value With actual value and unequal, so need to carry out in substantia nigra compacta by the award information of thalamus output and corpus straitum output Reason, and discharge dopaminergic SN_DPATo adjust the table of evaluation of estimate, dopamine responsive to differential signal is represented with formula (8)：

SN_DPA=r (t+1)+γ BG_strio(t+1)-BG_strio(t) (8)
2. the development automatic machine according to claim 1 with brain Mechanism of Cognition, it is characterised in that：Its learning method, bag Include following steps：

(1) initialize：Iterative learning step number initial value t=0, iterative learning number are step_max, initialize parameters and dash forward Weights are touched, then the probability that initial internal operation behavior is performed when experiment starts is identical；

(2) current state SC (t) is perceived；

(3) evaluation function BG is calculated in corpus straitum_strio(t), due to the presence of intrinsic motivation mechanism, according to current BG_strio(t) Value calculate orientation function η (t)；

(4) the action select probability BG of corpus straitum matrix is calculated according to formula according to orientation quality_matrix(s, a) and by cerebellum Execution action a (t)；

(5) according to state transition equation, state is by SC (t) → SC (t+1)；

(6) thalamus sends award r (t) immediately and triggers dopamine response regulation evaluation of estimate；

(7) by brain motor cortex output action y (t)；

(8) (2)~(7) are repeated until t=step_max；Study terminates.