CN107894715A

CN107894715A - The cognitive development method of robot pose path targetpath optimization

Info

Publication number: CN107894715A
Application number: CN201711117326.2A
Authority: CN
Inventors: 任红格; 史涛; 宫海洋; 李福进; 李军; 尹瑞; 徐少彬; 赵传松; 杜建; 王玮
Original assignee: North China University of Science and Technology
Current assignee: North China University of Science and Technology
Priority date: 2017-11-13
Filing date: 2017-11-13
Publication date: 2018-04-10

Abstract

The invention provides a kind of cognitive development algorithm CBCLA of robot pose path targetpath optimization, it is divided into eight parts, respectively limited internal state set, the output set of system, built-in function behavior set, state transition equation, in t, internally state, evaluation function, the action select probability of corpus straitum matrix export system, dopaminergic, it can be represented with a biquaternion group：CBCLA={ SC, MC, Cb_A,f,r(t),BG_strio,BG_matrix,SN_DPA; the nervous activity of present invention simulation organism sensorimotor system; using learning automaton as framework; with reference to the characteristics of intrinsic motivation mechanism drives organism autonomous learning; this cognitive development algorithm is applied among Research on Path Planning of Mobile Robot, and robot is developed under circumstances not known by autonomous learning; moving equilibrium control technical ability is gradually grasped, and realizes the real-time tracking of target.

Description

Cognitive development method for robot posture path target track optimization

The technical field is as follows:

the invention belongs to the technical field of intelligent robots, and relates to a robot attitude path target track optimization simulation method.

Technical background:

the human and animal intelligent behaviors are embodied by learning and memory, various skills of the human and animal intelligent behaviors are slowly formed and gradually developed in the process that a nervous system is self-learned and self-organized, the nervous activity and the self-regulation mechanism of the human and animal are learned and simulated and are endowed to an intelligent robot, and the human and animal intelligent behaviors are important research subjects of artificial intelligence and control science. In 1996, j.weng first proposed a concept of autonomous mental development of robots, who thought that agents should develop mental ability by interacting with unknown environments through sensors and effectors under the control of an intrinsic development program on the basis of a simulated human brain. Brooks et al emphasize that interactive learning of robots with teachers and environments gradually develops the intelligence of the robots, and propose a calculation model simulating the areas of prefrontal lobes, hypothalamus, hippocampus and the like in the cerebral cortex of human and animals by combining the research theory of neuroscience to process complex problems in complex environments, which also relates to a sensorimotor system. The initial cognitive development begins with the development and development of the coordinated mechanisms of the sensorimotor system, which is in turn constantly coordinated and refined during the process of intrinsic motivation development and development. The literature of neurological relevance shows that during the learning process of human and animal, the cerebral cortex, basal nucleus and cerebellum work in parallel in a self-specific way, and in the correlation of human and animal movement, the cerebellum and basal nucleus are distributed on two sides of the route of motor signal transmission between the cerebral cortex and the spinal cord, and participate in the initiation and control of any behavioral action.

Many scholars have conducted relevant research as early as the 80's of the 20 th century. In 2000, moren et al proposed a system combining emotion and behavior selection for MOWER-based two-process learning theory. Wang et al in 2007 put forward an intelligent model based on artificial emotion on the basis of a human brain emotion circuit, and apply the intelligent model to an inverted pendulum system to enable the inverted pendulum system to successfully learn balance control skills. In 2010, batto et al take reinforcement learning as a theoretical framework from the perspective of evolution, and adopt active learning driven by an internal motivation, so that the learning efficiency of an intelligent agent is greatly improved. In 2013, oudeyer and the like provide a system state transition error learning machine by starting from the exploration under the self-consciousness of organisms and combining the internal motivation thought, and realize the active exploration learning of unknown atmosphere. The research and application of a layered reinforcement learning and potential action model are proposed by Shenxianzhang et al in 2014, the potential action model of the obstacle is researched, and the application of the robot path planning problem in the obstacle environment is carried out by combining the layered reinforcement learning.

The biologically relevant literature concludes that there are mechanisms in the sensorimotor system of humans and animals whose motivation is linked to intrinsic targets, called intrinsic motivational mechanisms. The mechanism is a learning mechanism which is based on a sensory-motor system as a theoretical basis and is guided by the curiosity and orientation of organisms. Aiming at the problem that the traditional machine learning can not continuously learn, inspired by the research, a sensory-motor system is combined with an internal motivation mechanism, the cognitive behavior of an organism is simulated, a learning automaton is taken as a basic framework, a cognitive development algorithm based on a cerebellum-basilar nucleus-cerebral cortex loop is provided, and by utilizing the algorithm, two rounds of robots gradually master the motor balance control skill through the interaction with an unknown environment, and the real-time cognitive development is realized, so that the aim of continuous learning of machine learning is fulfilled.

The invention discloses a three-dimensional flight path planning method for a power transmission line inspection unmanned aerial vehicle, which is disclosed by related patents such as application publication No. CN 106403948A.

The invention patent of application publication No. CN106557844A discloses a welding robot path planning method based on a clustering guidance multi-target particle swarm optimization technology, which comprises the steps of establishing a D # H parameter model of a welding robot, obtaining an obstacle avoidance path through a geometric obstacle avoidance strategy, and carrying out Cartesian space-based trajectory planning on the obstacle avoidance path.

However, neither of the above patents addresses the practical exploration of imparting both human and animal neural activity and autoregulatory mechanisms to robots.

The invention content is as follows:

the invention provides a robot control method for simulating human psychological cognitive mechanism and brain nerve movement aiming at the learning of continuous behaviors of two-wheeled robots, which enables the robot to simulate the thinking mode of a cognitive development algorithm (CBCLA) of a cerebellum-basement nucleus-cerebral cortex loop of a human to perform bionic cognitive development and apply the bionic cognitive development method to the path planning research of a mobile robot, and provides a cognitive development method for optimizing a posture path target track of the robot, which specifically comprises the following steps:

a cognitive development method for optimizing a robot posture path target track combines a cognitive development algorithm (CBCLA) thinking of a human based on cerebellum, basal nucleus and cerebral cortex loop with a robot, the cognitive development process of the robot is divided into eight parts, namely a limited internal state set, a system output set and an internal operation behavior set, a state transfer equation, a system internal state at the time t, an evaluation function, a striatum matrix action selection probability output and dopaminergic energy, and the mutual correlation of the eight parts can be represented by an eight-element array:

CBCLA＝{SC,MC,Cb _A ,f,r(t),BG _strio ,BG _matrix ,SN _DPA }

1)SC＝[s ₁ ,s ₂ ,...s _j ]expressed as a limited set of internal states, corresponding to the sensory cortex, s, in the cerebral cortex _j Represents the jth state, j is the number of internal states;

2)MC＝[y ₁ ,y ₂ ,...y _i ]expressed as the output set of the system, corresponding to the motor cortex, y, in the cerebral cortex _i Represents the ith output, i represents the output number;

3)Cb _A ＝[a ₁ ,a ₂ ,...a _k ]expressed as a set of internal operating behaviors corresponding to the cerebellar region, a _k K is the kth internal action, and k is the number of internal actions;

4) S (t) × a (t) → s (t + 1) is a state transition equation, namely the state s (t + 1) at the time t +1 is determined by the state s (t) at the time t and the operation behavior a (t) together, and generally determined by an environment or a model;

5) r (t) = r (s (t), a (t)) represents a reward signal after the system makes the state transition to s (t + 1) after the internal operation action a (t) is taken when the internal state is s (t) at time t, and is relative to the thalamus sensation sent by the thalamus;

6)BG _strio as an evaluation function, the striated corpuscle is mainly an evaluation mechanism for predicting the movement orientation of the organism, and is further an evaluation mechanism for evaluating the orientation of an intrinsic motivation mechanism;

7)BG _matrix outputting the action selection probability of the striated corpuscle matrix, wherein the matrix in the striated corpuscles is mainly used for the action selection function in the learning process of the basal ganglia;

8)SN _DPA dopaminergic, as a guided incentive for behavioral assessment, as a behavioral representation that enhances the unknown maximum reward formed by the incentive, results in accurate performance of the action.

From the state transition equation f: S (t) × a (t) → S (t + 1), the external state S (t + 1) ∈ S at the time t +1 is always determined by the external state S (t) ∈ S at the time t and the external agent action a (t) ∈ a at the time t, and is independent of the external state and the external agent action before the time t.

The input signal in the cerebral cortex described above contains two parts, sensory cortical information and motor cortical information, respectively, as inputs to the striated bodies, and therefore:

CC＝{SC,MC} (1)

the invention defines an evaluation function BG _strio The following:

wherein γ ∈ [0,1 ]]The discount factor is the evaluation function BG of the system due to the existence of the intrinsic motivation mechanism _strio Gradually approaches to 0, thereby ensuring that the system is finally in a stable state; defining eta as an orientation core in an internal motor mechanism, wherein the main function is to guide the autonomous cognitive direction, and generally defining the value range of the orientation core eta as [ eta ] _min ,η _max ]I.e. between the best and the worst function value of orientation, the motor orientation function in the striated corpuscle is defined as shown in equation (3):

wherein lambda is a parameter of an orientation function, the difference value of the orientation functions at two adjacent moments is defined as theta (t) = eta (t) -eta (t-1), and the orientation degree of the system is judged, if the theta (t) =0, the moment t is larger than the orientation value at the moment t-1, otherwise, the theta (t) <0, the moment t is smaller than the orientation value at the moment t-1.

The invention adopts Boltzmann probability rule to realize the action selection function of the matrix and the probability selection mechanism of the learning automaton, and firstly defines:

according to the definition in equation (4), we can express the motion selection probability output of the striated corpuscle matrix as equation (5):

wherein, T is a temperature constant and represents the random degree of the selection of the action, the larger T is the degree of the action selection, conversely, the smaller T is the degree of the action selection, and when T gradually approaches zero, BG is obtained _strio (SC(t),a _j ) The corresponding action selection probability gradually approaches 1, and the value of T in the system gradually decreases along with time.

The evaluation function determined by the striated bodies at time t +1 is:

combining equation (2) and equation (6) yields equation (7):

BG _strio (t)＝r(t+1)+γBG _strio (t+1) (7)

this shows that at time t the evaluation function BG _strio (t) the evaluation function BG at time t +1 can be used _strio (t + 1), but the evaluation value BG is used because of the influence of the error existing in the initial stage of prediction _strio (t + 1) represents BG _strio The value of (t) is not equal to the actual value, so that reward messages output by the thalamic output and the striatonisomes need to be processed in the substantia nigra pars compacta and release dopaminergic SN _DPA The table for adjusting the evaluation value can be expressed by equation (8):

SN _DPA ＝r(t+1)+γBG _strio (t+1)-BG _strio (t) (8)

the invention simulates the neural activity of a biological sensory-motor system, takes a learning automata as a frame, combines the characteristic of driving an organism to autonomously learn by an internal motivation mechanism, and provides a cognitive development algorithm for optimizing the target track of a robot attitude path.

Description of the drawings:

FIG. 1 is a diagram of an algorithmic control structure according to the present invention;

FIG. 2 is a cognitive development robot frame;

FIG. 3 is a graph of response output for each state;

FIG. 4 is a graph of evaluation function versus error simulation;

FIG. 5 shows simulation results of an anti-interference experiment;

FIG. 6 is a comparison graph of the merit functions of the CBCLA algorithm and the classical LA algorithm;

FIG. 7 is an error contrast diagram of the CBCLA algorithm and the classical LA algorithm.

Detailed description of the preferred embodiments

The invention provides a cognitive development method based on a robot aiming at the learning of continuous behaviors of two-wheeled robots, simulating human psychological cognitive mechanism and brain nerve movement phenomenon, and based on the thinking activity of a cognitive development algorithm (CBCLA) of a human cerebellum-basalis-cerebral cortex loop, and the cognitive development method is applied to the path planning research of a mobile robot. The robot gradually grasps the motor balance control skill through autonomous learning development in an unknown environment, and realizes real-time tracking of the target.

According to the thought, a cognitive development method for optimizing a robot posture path target track is created, the method combines a human cognitive development algorithm (CBCLA) thinking based on a cerebellum-basement nucleus-cerebral cortex loop with a robot, the cognitive development process of the robot is divided into eight parts, namely a limited internal state set, a system output set, an internal operation behavior set, a state transition equation, a system internal state at the moment t, an evaluation function, a striatum matrix action selection probability output and dopaminergic energy, and the mutual correlation of the eight parts can be represented by an eight-element array:

CBCLA＝{SC,MC,Cb _A ,f,r(t),BG _strio ,BG _matrix ,SN _DPA }

the specific meanings of the respective parts are as follows:

(1)SC＝[s ₁ ,s ₂ ,...s _j ]expressed as a limited set of internal states, corresponding to the sensory cortex, s, in the cerebral cortex _j Represents the jth state, j being the number of internal states.

(2)MC＝[y ₁ ,y ₂ ,...y _i ]Expressed as the output set of the system, corresponding to the motor cortex, y, in the cerebral cortex _i Indicates the ith output, and i indicates the number of outputs.

(3)Cb _A ＝[a ₁ ,a ₂ ,...a _k ]Expressed as a set of internal operating behaviors corresponding to the cerebellar region, a _k Is the kth internal action, and k is the number of internal actions.

(4) S (t) × a (t) → s (t + 1) is a state transition equation, that is, the state s (t + 1) at the time t +1 is determined by the state s (t) at the time t and the operation behavior a (t) together, and is generally determined by an environment or a model.

(5) r (t) = r (s (t), a (t)) means a reward signal after the system shifts to s (t + 1) after the internal state is that s (t) is the internal operation action a (t) taken at time t, with respect to the thalamus giving out a hill sensation.

The input signal in the cerebral cortex contains two parts, sensory and motor cortical information, respectively, as input to the striated bodies, and thus:

CC＝{SC,MC} (1)

(6)BG _strio for the evaluation function, the striated corpuscle is mainly an evaluation mechanism for predicting the orientation of the living body action, and further an evaluation mechanism for predicting the orientation of the intrinsic motivation mechanism, and the evaluation function is defined as follows:

wherein γ ∈ [0,1 ]]Is a discount factor. Due to the existence of an internal motivation mechanism, the evaluation function BG of the system _strio Gradually approaches 0 to ensure that the system is finally in a steady state. We define η as the orientation core in intrinsic motivation mechanisms, the main function is to direct the autonomous cognitive direction. The value range of the orientation core eta is generally defined in [ eta [ ] _min ,η _max ]I.e., between the best and worst function values of orientation. Then the motive orientation function definition within the striated corpuscle is shown in equation (3):

and determining the orientation degree of the system by defining the difference value of the orientation functions at two adjacent moments as theta (t) = eta (t) -eta (t-1), wherein if the theta (t) =0, the orientation value at the moment t is larger than that at the moment t-1, otherwise, the orientation value at the moment t is smaller than that at the moment t-1, if the theta (t) <0, the orientation value at the moment t is smaller than that at the moment t-1.

(7)BG _matrix And (4) outputting the action selection probability of the striated corpuscle stroma, wherein the stroma in the striated corpuscles is mainly used for the action selection function in the learning process of the basal ganglia. One of the most important features in the learning process driven by the intrinsic motivation mechanism is the choice of action to perform according to the magnitude of probability. The Boltzmann probability rule is adopted to realize the action selection function of the matrix, so that the probability selection mechanism of the learning automaton is realized. First we define:

according to the definition in equation (4), we can express the motion selection probability output of the striated corpuscle matrix by equation (5):

where T is a temperature constant and indicates the degree of randomness of the selection of the motion, and a larger T indicates a larger degree of motion selection, whereas a smaller T indicates a smaller degree of motion selection. When T gradually approaches zero, BG _strio (SC(t),a _j ) The corresponding action selection probability gradually approaches 1, the value of T in the system gradually decreases along with time, which means that the system experiences gradually increasing knowledge in the learning process, and gradually evolves from an unstable system to a stable system.

(8)SN _DPA Dopaminergic, which can be a guideline incentive for a flat line of action, as a behavioral expression that enhances the unknown maximum reward formed by the incentive, thereby resulting in a more accurate performance action. The evaluation function determined by the striated bodies at time t +1 is:

combining equation (2) and equation (6) yields equation (7):

BG _strio (t)＝r(t+1)+γBG _strio (t+1) (7)

this shows that at time t the evaluation function BG _strio (t) the evaluation function BG at time t +1 can be used _strio (t + 1), but the evaluation value BG is used because of the influence of the error existing in the initial stage of prediction _strio (t + 1) represents BG _strio The value of (t) is not equal to the actual value, so that reward messages from thalamic output and striatal output need to be processed in the substantia nigra pars compacta and release dopaminergic SN _DPA The table for adjusting the evaluation value can be expressed by equation (8):

SN _DPA ＝r(t+1)+γBG _strio (t+1)-BG _strio (t) (8)

the convergence of the algorithm of the invention proves that:

evaluation of striated body outputCost function BG _stri o (t) is set toFor convenience of demonstration, this is expressed by equation (9):

BG _strio (t)＝J(t) (9)

application in Markov environmentIterative algorithm if there is an absolute value | r (s, a) | and an iterative initial value of an instant reward for any state action pair (s, a)Has a boundary of 0 ≦ γ&If the number of times of adjustment of each state action pair (s, a) is not limited when n approaches infinity, then 1, n is the number of iterationsWill eventually trend towards the optimal value J with a probability of 1 ^* (s,a)。

And (3) proving that: the absolute value of the difference between the merit function of any one state action pair (s, a) and its optimum is considered as:

wherein the state and action after the transition are s 'and a', the state of the secondary transition is s ",is in an arbitrary state. The maximum estimation error of the evaluation function at the nth iteration is set as follows:

then there are:

ΔJ _n ≤γΔJ _n-1 ≤γ ⁿ ΔJ ₀ (12)

because of the fact thatBounded, then Δ J ₀ Bounded, each (s, a) is hinted so Δ J as n approaches infinity ₀ Approaching 0. The cerebellar-basolateral nucleus-cortical loop-based cognitive development algorithm performed as an evaluation function converges at n → ∞ when the system is in an equilibrium steady state.

The method combines dynamic planning and animal physiology knowledge, thereby realizing machine online learning with report. The cognitive development algorithm is applied to the path planning research of the mobile robot, and the robot gradually grasps the motor balance control skill through autonomous learning development in an unknown environment and realizes the real-time tracking of the target.

The invention is further illustrated with reference to the following figures and embodiments.

Fig. 1 is a diagram showing a structure of an algorithm control according to the present invention, and the algorithm control is performed in the order shown in the figure. Fig. 2 shows a cognitive development robot frame, which corresponds to the state quantities shown in fig. 1. The balance of the robot is controlled first, because the robot capable of self-balancing is obtained on the premise of the experiment.

In order to verify the effectiveness, robustness and superiority of the cerebellum-basalis-cortex loop-based cognitive development algorithm (CBCLA) proposed herein, two rounds of robotics were used as experimental subjects to study how the robotics finally learn motor skills through autonomous learning in an unknown environment.

The robot has four output quantities and meets corresponding conditions in the experimental process, namely the angular speeds theta of the left and right wheels _r And theta _l Are all less than 3.489rad/s, and the self inclination angle alpha of the machine body&lt, 0.1744rad and the angular speed beta of the swing link of the robot&lt, 3.489rad/s. Discount factor γ =0.9 and sampling time is 0.01s. The robot obtains self-balancing standard is to keep 20000 steps in one test. The criteria for failure are that the number of attempts exceeds 1000 or that the number of balance steps per trial exceeds 20000 steps. In each experimentAfter failure, the initial state and each weight random value are endowed again in a certain range, and the next learning is carried out again.

(1) Balance control experiment: under an unknown environment without interference, the robot adopts the CBCLA algorithm provided by the text, through continuous learning, through 42 times of heuristics and completing an experiment in 43 times of heuristics, the balance control skill is learned after about 220 steps, namely about 2.2s, and the faster autonomous learning capability and the effectiveness of the algorithm are expressed, wherein fig. 3 shows the response curve of each state quantity of the first 3000 steps in the experiment, and fig. 4 shows the evaluation function and the error simulation curve of the first 3000 steps in the experiment.

(2) And (3) anti-interference experiment: in the actual operation process of the system, the input and output signals are more or less interfered by external noise, or the inaccuracy of the detection device can cause a certain error of the state quantity. Then, in order to simulate the actual environment, when the robot has learned the balance control and kept 9800 steps, a pulse signal with an amplitude of 25 is added to each input state quantity, and if the robot can withstand the disturbance of the pulse signal and keep balance, the experiment is considered successful and the CBCLA algorithm proposed herein is proved to have certain robustness. Fig. 5 shows the output response of each state after adding the pulse signal, and it can be seen that after about 200 steps (i.e. 2 s), the robot reaches the equilibrium position again.

(3) Algorithm comparison experiment: the algorithm introduces an internal motivation mechanism to drive the robot to independently learn, so that the error of the system is reduced, and the convergence rate of the algorithm is improved. In order to prove the superiority of the CBCLA algorithm, a balance control experiment is carried out on the two-wheeled robot by applying a classic Learning Automata (LA) algorithm and the CBCLA algorithm, and the experimental result is analyzed. The parameter settings of the two algorithms were the same in the experiment. Evaluation function BG _strio Whether the corresponding system can reach the stability or not is judged by comparing the simulation curves of the LA algorithm and the CBCLA algorithm with each other through the graph 6, and the quick stability comparison of the evaluation function shows that the CBCLA algorithm completes the learning of the balance control skill in about 220 steps (namely 2.2 s), while the classic LA algorithm completes the learning in about 600 steps (namely 6 s), thereby proving that the CBCLA algorithmHas a convergence speed superior to that of the classical LA algorithm. Error SN _DPA The stability of the system is reflected, and the error comparison graph of the two algorithms in fig. 7 shows that the error amplitude of the CBCLA algorithm is smaller than that of the classical LA algorithm, so that the stability of the system is facilitated.

Claims

1. A cognitive development method for optimizing a robot posture path target track combines a cognitive development algorithm (CBCLA) thinking of a robot based on cerebellum-basal nucleus-cerebral cortex loop with the robot, the cognitive development process of the robot is divided into eight parts which are respectively a limited internal state set, a system output set and an internal operation behavior set, a state transition equation, a system internal state at the time t, an evaluation function, an action selection probability output of a striatum matrix and dopaminergic energy, and the mutual correlation of the eight parts can be represented by an eight-element array:

CBCLA＝{SC,MC,Cb _A ,f,r(t),BG _strio ,BG _matrix ,SN _DPA }

2)MC＝[y ₁ ,y ₂ ,...y _i ]expressed as the output set of the system, corresponding to the motor cortex, y, in the cerebral cortex _i The ith output is represented, and i represents the output number;

3)Cb _A ＝[a ₁ ,a ₂ ,...a _k ]expressed as a set of internal operating behaviors corresponding to the cerebellar region, a _k K is the kth internal action, and k is the number of the internal actions;

4) S (t) × a (t) → s (t + 1) is a state transition equation, namely the state s (t + 1) at the time t +1 is determined by the state s (t) at the time t and the operation behavior a (t) together, and is generally determined by an environment or a model;

5) r (t) = r (s (t), a (t)) represents a reward signal after the system makes the state transition to s (t + 1) after the internal operation action a (t) is taken when the internal state is s (t) at the time t, relative to a thalamus sensation emitted by the thalamus;

6)BG _strio as an evaluation function, the striated corpuscle is mainly an evaluation mechanism for predicting the movement orientation of the living body, and is further an evaluation mechanism for evaluating the orientation of an intrinsic motivation mechanism;

2. The cognitive development method of robot pose path target trajectory optimization of claim 1, wherein: s (t) × a (t) → S (t + 1), from the state transition equation f, the external state S (t + 1) ∈ S at the time t +1 is always determined by the external state S (t) ∈ S at the time t and the external agent action a (t) ∈ a at the time t, and is independent of the external state and the external agent action before the time t.

3. The cognitive development method of robot pose path target track optimization of claim 1, wherein: the input signal in the cerebral cortex contains two parts, sensory and motor cortical information, respectively, as input to the striated bodies, and thus:

CC＝{SC,MC} (1)

4. the cognitive development method of robot pose path target track optimization of claim 1, wherein: defining an evaluation function BG _strio The following were used:

wherein γ ∈ [0,1 ]]As a discount factor, due to inherent motivation mechanismsSo that the evaluation function BG of the system _strio Gradually approaches to 0, thereby ensuring that the system is finally in a stable state; defining eta as an orientation core in an internal motor mechanism, wherein the main function is to guide the autonomous cognitive direction, and generally defining the value range of the orientation core eta as [ eta ] _min ,η _max ]I.e. between the best and the worst function value of orientation, the motor orientation function in the striated corpuscle is defined as shown in equation (3):

5. The cognitive development method of robot pose path target trajectory optimization of claim 1, wherein: adopting Boltzmann probability rule to realize the action selection function of the matrix and the probability selection mechanism of the learning automaton, and firstly defining:

wherein, T is a temperature constant and represents the random degree of action selection, the larger T is the degree of action selection, conversely, the smaller T is the degree of action selection, and when T gradually approaches zero, BG is the _strio (SC(t),a _j ) The corresponding action selection probability gradually approaches 1, and the value of T in the system gradually decreases along with the time.

6. The cognitive development method of robot pose path target trajectory optimization of claim 4, wherein: the evaluation function determined by the striated bodies at time t +1 is:

combining equation (2) and equation (6) yields equation (7):

BG _strio (t)＝r(t+1)+γBG _strio (t+1) (7)

SN _DPA ＝r(t+1)+γBG _strio (t+1)-BG _strio (t) (8)。