CN106598058A

CN106598058A - Intrinsically motivated extreme learning machine autonomous development system and operating method thereof

Info

Publication number: CN106598058A
Application number: CN201611182422.0A
Authority: CN
Inventors: 史涛; 任红格; 尹瑞; 李福进; 刘伟民; 张春磊; 宫海洋; 杜建; 王玮; 赵传松
Original assignee: North China University of Science and Technology
Current assignee: North China University of Science and Technology
Priority date: 2016-12-20
Filing date: 2016-12-20
Publication date: 2017-04-26

Abstract

The invention belongs to the technical field of intelligent robots, and concretely relates to an intrinsically motivated extreme learning machine autonomous development system and an operating method thereof. The autonomous development system comprises an inner state set, a motion set, a state transition function, an intrinsic motivation orientation function, a reward signal, a reinforced learning update iteration formula, an evaluation function and a motion selection probability. According to the invention, an intrinsic motivation signal is utilized to simulate an orientation cognitive mechanism of the interest of people in things so that a robot can finish relevant tasks voluntarily, thereby solving a problem that the robot is poor in self-learning. Furthermore, an extreme learning machine network is utilized to practice learning and store knowledge and experience so that the robot, if an experience fails, can use the stored knowledge and experience to keep exploring instead of learning from the beginning. In this way, the learning speed of the robot is increased, and a problem of low efficiency of reinforced learning for single-step learning is solved.

Description

The extreme learning machine that internal motivation drives is from development system and its operation method

Technical field

The invention belongs to intelligent robot technology field, and in particular to the extreme learning machine that a kind of internal motivation drives is spontaneous Educate system and its operation method.

Background technology

With the continuous development of today's society intellectual technology, roboticses are played extremely in the productive life of people Important effect, it can not only replace the mankind to complete some relative heavy tasks, and improve work to a certain extent Make efficiency, while having saved a large amount of human resourcess.

Internal motivation is extremely important concept in developmental psychology, is also that the open cognitive development of the mankind is vital Mechanism, it promotes intelligent body to go to explore and manipulate environment, cultivates its curiosity and participate in New activity interested.It is this Motivation can be survived, curiosity, and many factors such as tropism affect, thus internal motivation is referred to as in psychological development, it It is mechanism more crucial during sensorimotor and cognitive development.

2006, Q learning algorithms in combination with BP neural network, were realized the handstand of the non-discretization of state by Zhang Tao et al. The model-free study control of pendulum, improves pace of learning.2010, Ren Hongge et al. was employed based on Skinner operating conditions The Recurrent neural network learning algorithm of reflection theory completes double-wheel self-balancing robot control as the study mechanism of robot System, demonstrates the robust performance of this algorithm.2013, Oudeyer et al. just biological autopsychic search problems, and with reference to interior In motivation (Intrinsic motivation, IM) thought, systematic state transfer error learning machine is proposed, realized based on interior The Active searching of circumstances not known is learnt in the robot of Motivation Model.2013, the increment self-organization network that Shen et al. is proposed Network I/O mode, it is to avoid the problems such as the dimension disaster that caused by way of look-up table by traditional Q-learning, strengthens intelligence The ability that energy body is accumulated experience.2014, Hu Qixiang et al. was inspired by psychology internal motivation, it is proposed that a kind of internal motivational drives The online autonomic learning method of dynamic mobile robot circumstances not known, improves algorithmic statement degree, reduces systematic error, intelligent journey Degree is significantly improved.

At the beginning of Chinese invention patent CN201110255530.7 discloses a kind of robot intensified learning based on neutral net Beginning method, the method can effectively improve the learning efficiency of starting stage, accelerate convergence rate；Can by Q-value initialization Priori is dissolved in learning system, the study to the robot initial stage is optimized, so as to provide one for robot Individual preferable learning foundation, overcomes the deficiency that the research of existing robot intensified learning technology is present.

Chinese invention patent CN201510358313.9 discloses a kind of moving equilibrium robot based on internal motivation certainly Main cognitive system and control method, the patent is set up based on inherent dynamic mainly for the robot autonomous cognitive question of moving equilibrium The robot autonomous cognitive system of moving equilibrium of machine, the intellectual learning behavior to understanding the mankind in depth is more autonomous with construction to be recognized Know that robot provides method and solution route.

Chinese invention patent CN201510442275.5 discloses a kind of scanning certificate graphs picture based on extreme learning machine to be known Other method, the patent provides a kind of quick, generalization ability strong processing method for the similarity retrieval of certificate, significantly improves certificate The classification accuracy of image retrieval.

Initiative present in equilibrium problem currently for the motor control of double-wheel self-balancing robot is poor and conventional strong Chemistry is practised and the low problem of the single step learning efficiency is not yet solved, and is badly in need of a kind of extreme learning machine of internal motivation driving of offer spontaneous Educate system and its control method.

The content of the invention

The purpose of the present invention be exactly in order to overcome the shortcomings of the motor control equilibrium problem of existing double-wheel self-balancing robot, From development system, the system is learnt as framework the extreme learning machine driven there is provided a kind of internal motivation with strengthening Q, will be inherent dynamic Used as intrinsic reward, driven machine people learnt machine signal, while using extreme learning machine network depositing as knowledge accumulation Storage space, robot is enable as people, by self study, self-organizing, so as to progressive formation by apery brain learning model Technical ability is controlled with its balance is improved, with initiative difference in the motor control equilibrium problem for solving double-wheel self-balancing robot and in the past The intensified learning problem low to the single step learning efficiency.In order to solve above-mentioned technical problem, the present invention is obtained by following technical proposals To solve：

From development system, the system includes internal state set, set of actions, shape to the extreme learning machine that a kind of internal motivation drives State transfer function, internal motivation orientation function, reward signal, intensified learning update iterative formula, evaluation function and Action Selection Probability；The cognitive model of the system is combined with extreme learning machine network as framework with reinforcing Q learning algorithms, and by inherent dynamic Plane mechanism drives, and is designed to eight tuple models and is expressed as follows：

()

The implication of wherein each element is as follows：

(1)Internal state set,,Represent theIndividual state,To produce The number of all states of life.

(2)Set of actions,,Represent theIndividual action behavior, For the number of everything behavior.

(3)State transition function, the functionThe external status at momentAlways ByThe external status at momentWith external smart body actionTogether decide on, preferably determined with environment by system model.

(4)The orientation function of middle internal motivation, the internal motivation orientation function of the function has system Evaluation function determine.

(5)Reward signal, the signal represents that system existsMoment,Perform under state dynamic MakeAfterwards systematic state transfer is arrivedReward value afterwards.

(6)Middle intensified learning updates iterative formula, and the formula represents that system existsMoment exists External status areWhen the external smart body action that showedAfter be transferred to stateValue function afterwards.

(7)In evaluation function.

(8)In Action Selection probability, represent in stateLower selection actionIt is general Rate.

Using above-mentioned technical proposal the present invention compared with prior art, the unexpected technique effect for bringing is as follows：

1）The present invention uses internal motivation reward mechanism, by Cognition Mechanism of the internal motivation to hobby, by task Judgement determine evaluation of estimate, drive intelligent body spontaneously to complete appointed task.The method with traditional cognitive development method ratio Compared with being effectively improved the study initiative of intelligent body.

2）The present invention uses extreme learning machine network, and extreme learning machine network replacement conventional ad-hoc neutral net has been come Learn into training and stored knowledge and experience, make robot to learn from beginning after the failure of an experiment, but utilize what is stored Knowledge experience continues to explore.The method is greatly enhanced coaxial two wheels robot adaptation compared with traditional neutral net storage method The speed of environment, makes robot learn self-balancing in circumstances not known within a short period of time.

3）The present invention by internal motivation with classics Q learn (external motivation) algorithm in combination with, by desk evaluation value with it is outer The interaction of portion's adaptive value, enhances the study initiative of intelligent body, while also effectively raising robot learning efficiency.

The preferred version of the present invention is as follows：

In described state transition function, the state transition equation that its state-transferring unit determines is：

()

DescribedInThe external status at momentAlways byThe external status at momentWithThe external smart body action at momentDetermine, with itExternal status and external smart body before moment Action is unrelated.

The orientation function of described internal motivation is：

()

WhereinFor the parameter of orientation function, whenValue it is less, fewer, internal motivation in system is rewarded in corresponding action Orientation is less；Conversely, working asBigger, corresponding action reward is bigger, and the orientation of internal motivation is bigger in system.

Wherein described internal motivation describes strange degree, curious degree, degree of being weary of etc. in terms of psychology, is to drive people Or other biological is explored and the driving force for being learnt.The driving force is attributed to intelligent body and is drawn by internal motivation in learning process What is entered removes orientation mechanism function.Simulated human brain working mechanism so as to possess independent learning ability, improved learning efficiency.

In classical Q learning algorithms, foundation time difference TD algorithms are to Markov decision making processes for described reward signal Behavior value function is iterated calculating, and its iterative formula is：

()

WhereinFor Studying factors；

Its prize signal is updated to into the prize signal of internal motivation driving：

()

In formula,For internal motivation function,For outside motivation function,WithRepresent respectivelyWithWeight, iteration is public Formula is：

()

The reinforcing Q learning algorithms are modeled by Markov decision making processes, and iteration goes out optimal solution：

()

Wherein,For discount factor,For Studying factors, and；

In the algorithm, intrinsic reward function is replaced by internal motivation, it is assumed that intelligent body runs in a foreign environment, and its is defeated Output is, and desired output is defined as, then the difference of both, it is defined as the inside prize of system Encourage function.When system existsMoment selection actionWhen, state can be fromIt is transferred toIf,, i.e. system The error of the error ratio previous moment of generation is little, explanationThe action that moment is chosen makes system reach the effect of expectation target state Fruit is comparedThe effect of moment selection action is more preferable, while also illustrating that system existsThe orientation at moment is big；, whereas if, then illustrate that system existsThe orientation at moment is little.

Further, the reinforcing Q learning algorithm flow processs are as follows：

Step 1：Random initializtion；

Step 2：Observation current stateAnd select to perform an action decision-making；

Step 3：Obtain NextState, and while obtain prize signal；

Step 4：According to formula () update Q-value.

Under the driving of internal motivation mechanism so that internal actions evaluation functionGradually level off to 0.So as to make two Wheel robot can keep most suitable statokinetic, described evaluation function to be defined as follows：

()

WhereinFor discount factor；

The evaluation function at moment is：

()

The evaluation function at moment can be byThe evaluation function at moment is represented.

WillObserve as an observer, set up TD error difference formulas as follows：

()

With intensified learning as framework, the thought of intelligent body is driven with reference to internal motivation, using internal motivation orientation function as the party The reward mechanism of method, and be trained using extreme learning machine network so that robot autonomous learning capacity is strengthened, while Also pace of learning is substantially increased.

Described Action Selection probability is random chance.

Present invention also offers a kind of operation method of extreme learning machine of internal motivation driving from development system, including such as Lower step：

Step 1：Initialization current system conditions, choose discount factor, Studying factors, choose suitable The weight of internal motivation function and outside motivation functionWith。

Step 2：The Q-value of all action that may be taken is calculated in intensified learning.

Step 3：According to Q-value, suitable action is selected.

Step 4：Current action is performed, and the study to next stage makes a policy.

Step5：Calculate internal motivation function, while calculating the action decision-making of optimum according to intensified learning.

()

Step 6：According to formula () update Q-value.

Step7：Update current time, with current state。

Step8：Repeat 2 ~ Step of Step 7 until training is finished.

Description of the drawings

Fig. 1 is the systematic training flow chart of the present invention.

Fig. 2 is coaxial two wheels robot system construction drawing.

Fig. 3 is coaxial two wheels robot simplified structure diagram.

Fig. 4 is the network model of extreme learning machine.

Fig. 5 is to develop system structural framework certainly based on internal motivation.

Fig. 6 is that the extreme learning machine that internal motivation drives develops system structural framework certainly.

Fig. 7 is quantity of state change curve.

Fig. 8 is evaluation function and curve of error.

Fig. 9 is robot stressing conditions curve.

Figure 10 is evaluation function simulation comparison.

Figure 11 is systematic error simulation comparison.

Specific embodiment

The present invention is further elaborated for the embodiment be given with reference to Fig. 1 to Figure 11, but embodiment is not to the present invention Constitute any restriction.

The extreme learning machine that the internal motivation of the present invention drives is shown in Fig. 6 from system structural framework is developed, and according to shown in Fig. 1 Flow process be trained study.

Fig. 2 gives coaxial two wheels robot system structure model, is simulation inverted pendulum model on its spirit.Fig. 3 is given Coaxial two wheels robot simplified structural modal and parameter, its design parameter implication such as following table.

Fig. 4 is extreme learning machine network architecture, and it is a kind of simple Single hidden layer feedforward neural networks.Fig. 5 is base In internal motivation from development model structure framework, its training storage network application is to pass

The self organizing neural network of system.

Coaxial two wheels robot first had to ensure that it can walk upright before various tasks are realized, that is, keeps balancing.In order to The extreme learning machine that a kind of checking internal motivation proposed by the invention drives is tested from the effectiveness and autonomy of development model With coaxial two wheels robot mathematical model as object of study, the robot self-balancing technical ability under circumstances not known is studied.

1. experimental design

Double-wheel self-balancing robot is completed into autonomic balance as the control targe of experiment in circumstances not known, is driven in internal motivation Extreme learning machine from developmental mechanism, action extreme learning machine (ELM) network choose, its four state inputs Respectively robot itself inclination angle, car body own angular velocity, displacement and body speed of vehicle, are output as the control of coaxial two wheels robot system Amount.Evaluate network to choose, its input quantity is respectively four states of robot and by action extreme learning machine net The system control amount of network output layer output, is output as evaluation functionWith the change of fuselage stress.Evaluation functionRepresent For：

()

Wherein,For award return value, when robot itself inclination angle withWhen, system will obtain an award Value, otherwise.Appropriate discount factor and Studying factors is have chosen, the sampling time is chosen.Examination every time More than 200 times or again step number is regarded as the failure of an experiment more than 15000 steps in training process to test exploration number of times, now should terminate Test is simultaneously tested again.If in once testing 15000 steps can be kept not fall, illustrate that robot is autonomous complete under circumstances not known Into balance control.Sound out after failure every time, original state, weight threshold are reset to into a range of random number, then It is secondary to be trained.The result for summarizing 60 experiments shows that after 65 failures of empirical average, robot can be realized as self-balancing Control, embodies stronger self study and adaptive ability.Simulation result is shown in Fig. 7.

2. interpretation

In order to verify effectiveness of the invention and convergence, emulation is carried out to coaxial two wheels robot system motion balance quality real Test, and experimental result is analyzed.

Fig. 7 represents double-wheel self-balancing robot by a kind of extreme learning machine of internal motivation driving proposed by the present invention certainly Four quantity of states after development system training change over curve, and as seen from the figure robot is completed after 3s, i.e. 300 steps Self-balancing is controlled.Embody invention self study faster and adaptive ability.

Fig. 8 represents the evaluation function curve and curve of error of the step system state amount of robot 3000 in training process.

Fig. 9 is then robot stress change curve.

Figure 10 is a kind of extreme learning machine that drives of internal motivation proposed by the invention from development method (IM-Q-ELM) With the evaluation function simulation comparison figure of traditional nitrification enhancement (RL).

Figure 11 is both the above Algorithm Error simulation comparison.As can be seen that IM-Q-ELM methods proposed by the invention Robust performance is much strong to cross the latter.Thus, learnt by above-mentioned experiment, the intensified learning that internal motivation drives learns through the limit More excellent performance can be obtained after machine network training with faster learning training speed.Equally also show double-wheel self-balancing machine The stronger adaptive ability of people and control ability.

The present invention proposes a kind of extreme learning machine of internal motivation driving from development system, and is applied to double-wheel self-balancing machine In device people balance control, the reward mechanism of traditional intensified learning is replaced with internal motivation mechanism, and determine evaluation of estimate；By pole Limit learning machine network replaces conventional ad-hoc neutral net to complete training study and stored knowledge and experience, is greatly enhanced two Wheel robot adapts to the speed of environment, makes robot learn self-balancing in circumstances not known within a short period of time.

Those skilled in the art can have kinds of schemes to realize the present invention, the above without departing from the essence and spirit of the present invention Described is only the preferably feasible embodiment of the present invention, not thereby limits to the interest field of the present invention, all with the present invention The equivalent structure change that description and accompanying drawing content are made, is both contained within the interest field of the present invention.

Claims

1. the extreme learning machine that a kind of internal motivation drives from development system, the system include internal state set, set of actions, State transition function, internal motivation orientation function, reward signal, intensified learning update iterative formula, evaluation function and action choosing Select probability；The cognitive model of the system is combined with extreme learning machine network as framework with reinforcing Q learning algorithms, and by inherence Motivational mechanism drives, and is designed to eight tuple models and is expressed as follows：

()

The implication of wherein each element is as follows：

(1)Internal state set,,Represent theIndividual state,To produce The number of all states；

(2)Set of actions,,Represent theIndividual action behavior,For institute There is the number of action behavior；

(3)State transition function, the functionThe external status at momentAlways byWhen The external status at quarterWith external smart body actionTogether decide on, preferably determined with environment by system model；

(4)The orientation function of middle internal motivation, the internal motivation orientation function of the function is systematic to be commented Valency function is determined；

(5)Reward signal, the signal represents that system existsMoment,Execution action under state Afterwards systematic state transfer is arrivedReward value afterwards；

(6)Middle intensified learning updates iterative formula, and the formula represents that system existsMoment is in outside State isWhen the external smart body action that showedAfter be transferred to stateValue function afterwards；

(7)In evaluation function；

(8)In Action Selection probability, represent in stateLower selection actionProbability.

2. the extreme learning machine that internal motivation according to claim 1 drives is from development system, it is characterised in that described In state transition function, the state transition equation that its state-transferring unit determines is：

()

DescribedInThe external status at momentAlways byThe external status at moment WithThe external smart body action at momentDetermine, with itExternal status and the action of external smart body before moment without Close.

3. the extreme learning machine that internal motivation according to claim 1 drives is from development system, it is characterised in that described The orientation function of internal motivation is：

()

4. the extreme learning machine that internal motivation according to claim 1 drives is from development system, it is characterised in that described Reward signal is carried out according to time difference TD algorithms in classical Q learning algorithms to the behavior value function of Markov decision making processes Iterate to calculate, its iterative formula is：

()

WhereinFor Studying factors；

()

In formula,For internal motivation function,For outside motivation function,WithRepresent respectivelyWithWeight, iterative formula For：

()

The extreme learning machine that internal motivation according to claim 1 drives is from development system, it is characterised in that the reinforcing Q learning algorithms are modeled by Markov decision making processes, and iteration goes out optimal solution：

()

Wherein,For discount factor,For Studying factors, and；

The extreme learning machine that internal motivation according to claim 5 drives is from development system, it is characterised in that the reinforcing Q learning algorithm flow processs are as follows：

Step 1：Random initializtion；

Step 3：Obtain NextState, and while obtain prize signal；

Step 4：According to formula () update Q-value.

5. the extreme learning machine that internal motivation according to claim 1 drives is from development system, it is characterised in that described Evaluation function is defined as follows：

()

WhereinFor discount factor；

The evaluation function at moment is：

()

WillObserve as an observer, set up TD error difference formulas as follows：

()

The extreme learning machine that internal motivation according to claim 1 drives is from development system, it is characterised in that described is dynamic The probability that elects is random chance.

6. the extreme learning machine that the internal motivation described in any one of claim 1 ~ 8 drives is wrapped from the operation method of development system Include following steps：

Step 1：Initialization current system conditions, choose discount factor, Studying factors, choose suitable The weight of internal motivation function and outside motivation functionWith；

Step 2：The Q-value of all action that may be taken is calculated in intensified learning；

Step 3：According to Q-value, suitable action is selected；

Step 4：Current action is performed, and the study to next stage makes a policy；

Step5：Calculate internal motivation function, while calculating the action decision-making of optimum according to intensified learning；

()

Step 6：According to formula () update Q-value；

Step7：Update current time, with current state；

Step8：Repeat 2 ~ Step of Step 7 until training is finished.