CN104932264A

CN104932264A - Humanoid robot stable control method of RBF-Q learning frame

Info

Publication number: CN104932264A
Application number: CN201510299823.3A
Authority: CN
Inventors: 毕盛; 黄铨雍; 韦如明; 闵华清; 董敏
Original assignee: South China University of Technology SCUT
Current assignee: South China University of Technology SCUT
Priority date: 2015-06-03
Filing date: 2015-06-03
Publication date: 2015-09-23
Anticipated expiration: 2035-06-03
Also published as: CN104932264B

Abstract

The invention discloses a humanoid robot stable control method of an RBF-Q learning frame. The method comprises the following steps: the RBF-Q learning frame which solves the problems of state space serialization and behavior space serialization in a Q learning process is brought forward; an online motion adjusting stable control algorithm of the RBF-Q learning is brought forward, loci of the hip joint, the knee joint and the ankle joint of a support leg are generated, and a humanoid robot is controlled to walk stably through calculation of angles of other joints; and finally, the feasibility and validity of an RBF-Q learning frame method are verified on the Vitruvian Man humanoid robot platform designed by the laboratory. The method provided by the invention can generate a stable walking gait of the humanoid robot in an online learning process.

Description

Based on the apery robot stabilized control method of Q learning framework of RBF network

Technical field

The present invention relates to anthropomorphic robot walking stability contorting field, be specifically related to Q learning framework (RBF-Q Learning) the apery robot stabilized control method based on RBF network.

Background technology

It is the complicated control problem of solution one that double feet walking on anthropomorphic robot platform controls its essence of research.And the solution of complicated problem, be generally by carrying out modeling to whole system, solving system equation solves.But in reality, we usually run into such problem, namely problem itself is difficult to accurate model and is described, or the too various complexity of parameter that system relies on, to such an extent as to is difficult to be solved by the mode of solving system equation.Now can by study but not meticulous Modling model solves problem.

The control problem of anthropomorphic robot double feet walking has highly unstable and features such as mission nonlinear, is difficult to by Accurate Model mode, obtains a perfect solution.The method of intensified learning and neural network has been proved to be effective in the control problem of complexity.These methods do not need the deviser of system to the accurate awareness and understanding of system dynamics own profound.By the mode of study, these methods may provide the perfect solution surmounting deviser's ken.Meanwhile, such method has the ability of continuous learning and improvement, as occurring in nature animal by learning and adapt to obtain their most of abilities.

Summary of the invention

The present invention with the walking stability under Humanoid Complicated ground environment for goal in research, the difficulty realizing state space serialization and action space serialization is difficult to for intensified learning Q study, propose a kind of Q learning framework based on RBF network (RBF-Q Learning), and use this Frame Design and achieve anthropomorphic robot walking stable control method, the validity of the method is verified finally by emulation and tangible machine people.

The invention provides the apery robot stabilized control method of Q learning framework based on RBF network, anthropomorphic robot on-line study can be made to produce stable gait planning, thus realize apery robot stabilized walking, comprise following steps:

(1) the Q learning framework (RBF-Q Learning) based on RBF network designs.

The present invention devises one for the Q learning framework of continuous space based on RBF network, and framework uses has the RBF network matching Q function of stronger overall approximation capability, and uses gradient descent method to solve often to walk the maximal value in iteration and optimum behavior.Algorithm can carry out online adjustment in real time and study for the complexity of process problem to RBF network structure and parameter, possesses good generalization ability.

In conjunction with RBF network and Q study, the present invention devises RBF-Q Leanring algorithm frame, approaches matching in conjunction with RBF network to the Q function in Q study.Suppose that Q function receives a state vector s (t) and action vector a (t) input, and export scalar Q (t).

1) RBF neural design

Input layer: s (t) represents the state that in Q study, Q function inputs in t; A (t) represents the action that in Q study, Q function inputs in t;

Hidden layer: y _it () is hidden layer RBF activation function, use gaussian kernel function as neuronic RBF activation function; For i-th neuronic RBF activation function, it exports to use following formulae discovery:

y_{i} (t) = \exp (- \frac{| | x (t) - μ_{i} (t) | |}{2 σ_{i}^{2} (t)}), i = 1,2, . . ., k

Wherein, x is input variable, μ _iand σ _ibe i-th neuronic center and standard deviation respectively, k is RBF activation function number.

Output layer: Q (t) represents Q function and exports, and uses lower formula to upgrade,

Q (t) = Σ_{i = 1}^{k} w_{i} (t) y_{i} (t)

Wherein, w _ibe that i-th neuron exports the weight in Q function.

2) RBF network upgrades

Definition Q learning error δ _q, as follows:

δ _Q＝(1-λ)(r+γQ _max-Q(s,a*,t))

Wherein, λ is Studying factors (0≤λ≤1); γ is decay factor 0 < γ < 1; Q _maxfor Q maximal value current in iterative process; R is return value immediately; A* represents that optimum action is selected; S is input state; Error delta _qindicate the degree of convergence of Q function in learning process; The learning performance index E defining this system is as follows:

E (t) = \frac{1}{2} δ_{Q}^{2} (t)

Use BP algorithm and gradient descent method, RBF network is upgraded, for each neuronic output weight w _i,

Have and upgrade formula as follows:

w_{i} (t + 1) = w_{i} (t) + α_{w} \frac{&PartialD; E}{&PartialD; w}

Wherein, α _wfor learning rate, for E (t) and w _it (), has:

\frac{&PartialD; E}{&PartialD; δ_{Q}} = \frac{&PartialD; (\frac{1}{2} δ_{Q}^{2})}{&PartialD; δ_{Q}} = δ_{Q}

\frac{&PartialD; δ_{Q}}{&PartialD; w} = y_{i}

According to chain rule, each neuronic output weight w _i, more new formula becomes:

w _i(t+1)＝w _i(t)+α _wδ _Q(t)y _i(t)

For center and the standard deviation μ of each neuron RBF function _iand σ _i, have and upgrade formula as follows:

μ_{i} (t + 1) = μ_{i} (t) + α_{μ} δ_{Q} (t) w_{i} (t) y_{i} (t) \frac{x_{i} (t) - μ_{i} (t)}{σ_{i}^{2} (t)}

σ_{i} (t + 1) = σ_{i} (t) + α_{σ} δ_{Q} (t) w_{i} (t) y_{i} (t) \frac{{| | x_{i} (t) - μ_{i} (t) | |}^{2}}{σ_{i}^{3} (t)}

Wherein, α _μand α _σbe respectively the learning rate of RBF function center and standard deviation;

3) gradient descent method solves Q and learns next step behavior

Learn for discrete Q, solve max{Q (s (t), b, t)/b ∈ A} by traversal Q table, namely b represents next step optimum behavior, and for the Q function of Continuous behavior, adopts gradient descent method to solve next step behavior,

Minimum problem min{-Q (s (t), b, t)/b ∈ A} can be converted into max{Q (s (t), b, t)/b ∈ A}; Suppose that current state is s (t), for function-Q (s (t), b, t), have gradient direction:

&dtri; Q (a) = {[- \frac{&PartialD; Q (s (t), a, t)}{&PartialD; a_{1}}, . . ., - \frac{&PartialD; Q (s (t), a, t)}{&PartialD; a_{m}}]}^{T}

Solve in iteration in each step, a upgrades in the other direction to gradient, has:

a (i + 1) = a (i) + λ_{a} &dtri; Q [a (i)]

Wherein, λ _afor step-length, max{Q (s (t), b, t)/b ∈ A} is solved for gradient descent method, has total algorithm step as follows:

1. initiation parameter, comprising: allowable error Δ E _min, maximum iteration time k, step-length λ _aand random appointment initial value a (0), make i=0;

2. for a (i), utilize

&dtri; Q (a) = {[- \frac{&PartialD; Q (s (t), a, t)}{&PartialD; a_{1}}, . . ., - \frac{&PartialD; Q (s (t), a, t)}{&PartialD; a_{m}}]}^{T}

Ask current gradient direction ▽ Q [a (i)];

3. formula is used

a (i + 1) = a (i) + λ_{a} &dtri; Q [a (i)],

Upgrade and obtain a (i+1);

4. error of calculation Δ E=||a (i+1)-a (i) ||, if Δ E≤Δ E _minor i > k, then stop; Otherwise, make i=i+1, jump to step 2.;

(2) design is based on the online actions adjustment stability controller of RBF-Q Learning framework

For front and back and the left and right both direction of robot, design two stability controllers respectively:

1) stability contorting of fore-and-aft direction

For left foot driving phase (right crus of diaphragm is in like manner), for the stability contorting of anthropomorphic robot fore-and-aft direction, the state of definition RBF-QLearning study is input as following:

s _pitch(t)＝[θ _{hip_pitch}(t),θ _{knee_pitch}(t),θ _{ankle_pitch}(t),θ _xz(t)]

Wherein, θ _{hip_pitch}(t), θ _{knee_pitch}(t), θ _{ankle_pitch}t () is respectively left foot hip joint pitch-control motor angle, knee joint steering wheel angle and ankle-joint steering wheel angle in the gait of t anthropomorphic robot off-line basis, θ _xzt () is the trunk in t xz plane-plumb line angle.

Left leg hip joint pitch-control motor, knee joint steering wheel and ankle-joint steering wheel are depended primarily on to fore-and-aft direction stability contorting, therefore output behavior is defined as its on-line tuning value:

a _pitch(t)＝[Δθ _{hip_pitch}(t),Δθ _{knee_pitch}(t),Δθ _{ankle_pitch}(t)]

Wherein, Δ θ _{hip_pitch}(t), Δ θ _{knee_pitch}(t), Δ θ _{ankle_pitch}t () is respectively the adjustment angle of current hip joint pitch-control motor degree, knee joint steering wheel, ankle-joint steering wheel.

For robot take the judge of behavior, the robot health deflection angle using attitude sensor information to obtain calculates Reward Program immediately.

Definition anterior-posterior approach intensified learning stability controller immediately Reward Program is:

r_{pitch} (t) = [\begin{matrix} a_{1} & a_{2} \end{matrix}] \cdot [\begin{matrix} r_{1} (t) \\ r_{2} (t) \end{matrix}]

Wherein, a ₁, a ₂for Reward Program weights immediately,

r_{1} (t) = \{\begin{matrix} 0 & | θ_{xz} (t) | \leq ϵ \\ - | θ_{xz} (t) | & otherwise \end{matrix}

r_{2} (t) = \{\begin{matrix} 0 & | Δ θ_{xz} (t) | \leq | Δ θ_{xz} (t - 1) | \\ - 1 & otherwise \end{matrix}

Wherein, ε is allowable error band, θ _xz(t) and Δ θ _xzt () is respectively trunk-plumb line angle in t xz plane and angular velocity thereof.Reward Program is intended to θ immediately _xzt () controls in allowable error band, simultaneously its rate of change Δ θ _xzt () is as far as possible little.

2) stability contorting of left and right directions

For the stability contorting of anthropomorphic robot fore-and-aft direction, equally, the state defining RBF-Q Learning study is input as following:

s _roll(t)＝[θ _{hip_roll}(t),θ _{ankle_roll}(t),θ _yz(t)]

Wherein, θ _{hip_roll}(t) and θ _{ankle_roll}t () is respectively the angle of left foot hip joint rolling steering wheel, ankle-joint rolling steering wheel in the gait of t anthropomorphic robot off-line basis, θ _yzt () is the trunk in t yz plane-plumb line angle.

Due to, left leg hip joint rolling steering wheel is then depended primarily on to left and right directions stability contorting, ankle-joint rolling steering wheel determines, therefore output behavior is defined as:

a _roll(t)＝[Δθ _{hip_roll}(t),Δθ _{ankle_roll}(t)]

Wherein, Δ θ _{hip_roll}(t) and Δ θ _{ankle_roll}t () is respectively the adjustment angle of hip joint rolling steering wheel, ankle-joint rolling steering wheel.

Consider to use the trunk-plumb line angle on z-plane and the stability on angular velocity evaluation left and right directions thereof, definition left and right method intensified learning stability controller immediately Reward Program is:

r_{roll} (t) = [\begin{matrix} a_{1} & a_{2} \end{matrix}] \cdot [\begin{matrix} r_{1} (t) \\ r_{2} (t) \end{matrix}]

Wherein, a ₁, a ₂for Reward Program weights immediately,

r_{1} (t) = \{\begin{matrix} 0 & | θ_{yz} (t) | \leq ϵ \\ - | θ_{yz} (t) | & otherwise \end{matrix}

r_{2} (t) = \{\begin{matrix} 0 & | Δ θ_{yz} (t) | \leq | Δ θ_{yz} (t - 1) | \\ - 1 & otherwise \end{matrix}

Wherein, ε is allowable error band, θ _yz(t) and Δ θ _yzt () is respectively trunk-plumb line angle in t yz plane and angular velocity thereof.Reward Program is intended to θ immediately _yzt () controls in allowable error band, simultaneously its rate of change Δ θ _yzt () is as far as possible little.

Compared with prior art, the present invention has the following advantages:

(1) Q learning framework (RBF-Q Learning) method based on RBF network makes the stability of robot ambulation be optimized, and possesses online learning ability.After certain study, anthropomorphic robot stably walking can pass through complicated ground environment.

(2) control problem of anthropomorphic robot double feet walking has highly unstable and features such as mission nonlinear, be difficult to by Accurate Model mode, Q learning framework (RBF-Q Learning) method based on RBF network does not need the deviser of system to the accurate awareness and understanding of system dynamics own profound.By mode of learning, the inventive method can provide the perfect solution surmounting deviser's ken.This method has the ability of continuous learning and improvement simultaneously, as occurring in nature animal by learning and adapt to obtain their most of abilities.

Accompanying drawing explanation

Fig. 1 is RBF-Q Learning network structure.

Fig. 2 is RBF-Q Learning algorithm frame schematic flow sheet.

Fig. 3 be use the robot of online actions adjustment stability contorting upward slope landform up walk angular velocity curve (after 1000 walkings, top curve corresponds to robot around y-axis angular velocity (i.e. swing), and the corresponding anthropomorphic robot of lower curve is around x-axis angular velocity (namely swinging) data.

Fig. 4 uses the robot angular velocity curve of walking in accidental relief of online actions adjustment stability contorting (after 1000 walkings, the corresponding robot of top curve is around y-axis angular velocity (i.e. swing) data, and the corresponding anthropomorphic robot of lower curve is around x-axis angular velocity (namely swinging) data).

Embodiment

Describe the specific embodiment of the present invention in detail below in conjunction with accompanying drawing, but enforcement of the present invention and protection are not limited thereto, if it is noted that have symbol or the process of special detailed description below, be all that those skilled in the art can refer to existing techniques in realizing.

(1) by using 3 dimension inverted pendulum models to carry out ZMP analysis to the anthropomorphic robot model simplified, robot barycenter and foothold track in gait processes is calculated.Use robot barycenter and foothold track, by reverse movement Epidemiological Analysis, we obtain each joint motions track in anthropomorphic robot gait processes, and preserve as the basic gait information of robot off-line.

(2) the Q learning framework (RBF-Q Learning) based on RBF network designs.

1) the Q function of RBF network matching

In conjunction with RBF network, matching is approached to the Q function in Q study.Suppose that Q function receives a state vector s (t) and action vector a (t) input, and export scalar Q (t), have RBF neural as follows, see accompanying drawing 1.

Input layer: s (t) represents the state that in Q study, Q function inputs in t, altogether n dimension; A (t) represents the action that in Q study, Q function inputs in t, altogether m dimension.

Hidden layer: y (t) is hidden layer RBF activation function, k altogether.Use gaussian kernel function as neuronic RBF activation function; For i-th neuronic RBF activation function, it exports to use following formulae discovery:

y_{i} (t) = \exp (- \frac{| | x (t) - μ_{i} (t) | |}{2 σ_{i}^{2} (t)}), i = 1,2, . . ., k

Q (t) = Σ_{i = 1}^{k} w_{i} (t) y_{i} (t)

Wherein, w _ibe that i-th neuron exports the weight in Q function.

For the renewal of this RBF network, definition Q learning error δ _q, as follows:

δ _Q＝(1-λ)(r+γQ _max-Q(s,a*,t))

Wherein, λ is Studying factors (0≤λ≤1); γ is decay factor (0 < γ < 1); Q _maxfor Q maximal value current in iterative process; R is return value immediately; A* represents that optimum action is selected; S is input state; Error delta _qindicate the degree of convergence of Q function in learning process; The learning performance index E defining this system is as follows:

E (t) = \frac{1}{2} δ_{Q}^{2} (t)

Use BP algorithm and gradient descent method, RBF network is upgraded, for each neuronic output weight w _i, have and upgrade formula as follows:

w_{i} (t + 1) = w_{i} (t) + α_{w} \frac{&PartialD; E}{&PartialD; w}

Wherein, α _wfor learning rate, for E (t) and w _it (), has:

\frac{&PartialD; E}{&PartialD; δ_{Q}} = \frac{&PartialD; (\frac{1}{2} δ_{Q}^{2})}{&PartialD; δ_{Q}} = δ_{Q}

\frac{&PartialD; δ_{Q}}{&PartialD; w} = y_{i}

w _i(t+1)＝w _i(t)+α _wδ _Q(t)y _i(t)

μ_{i} (t + 1) = μ_{i} (t) + α_{μ} δ_{Q} (t) w_{i} (t) y_{i} (t) \frac{x_{i} (t) - μ_{i} (t)}{σ_{i}^{2} (t)}

σ_{i} (t + 1) = σ_{i} (t) + α_{σ} δ_{Q} (t) w_{i} (t) y_{i} (t) \frac{{| | x_{i} (t) - μ_{i} (t) | |}^{2}}{σ_{i}^{3} (t)}

Wherein, α _μand α _σbe respectively the learning rate of RBF function center and standard deviation.

2) gradient descent method solves Q and learns next step behavior

Learn for discrete Q, solve max{Q (s (t), b, t)/b ∈ A} by traversal Q table, namely b represents next step optimum behavior, and for the Q function of Continuous behavior, adopts gradient descent method to solve next step behavior.

&dtri; Q (a) = {[- \frac{&PartialD; Q (s (t), a, t)}{&PartialD; a_{1}}, . . ., - \frac{&PartialD; Q (s (t), a, t)}{&PartialD; a_{m}}]}^{T}

a (i + 1) = a (i) + λ_{a} &dtri; Q [a (i)]

2. for a (i), utilize

&dtri; Q (a) = {[- \frac{&PartialD; Q (s (t), a, t)}{&PartialD; a_{1}}, . . ., - \frac{&PartialD; Q (s (t), a, t)}{&PartialD; a_{m}}]}^{T}

Ask current gradient direction

3. formula is used

a (i + 1) = a (i) + λ_{a} &dtri; Q [a (i)]

Upgrade and obtain a (i+1);

4. error of calculation Δ E=||a (i+1)-a (i) ||, if Δ E≤Δ E _minor i > k, then stop; Otherwise, make i=i+1, jump to step 2..

In conjunction with RBF neural and gradient descent method, we do a whole description to RBF-Q Learning algorithm frame, and algorithm flow chart is as accompanying drawing 2.

(3) design is based on the online actions adjustment stability controller of RBF-Q Learning framework

Design the state input for anthropomorphic robot walking RBF-Q Learning learning framework and behavior output.Anthropomorphic robot double feet walking process is the process that two different walkings are changed mutually, (being that right crus of diaphragm is taken a step as example with the first step), namely right crus of diaphragm driving phase is transformed into from left foot driving phase, and circulate with this, phase transition centre is generally also interspersed with an of short duration biped driving phase.At left foot driving phase, left foot supports the three-dimensional inverted pendulum model formed and stablizes primarily of left foot servos control, and now, robot in the longitudinal direction stable is determined by left leg hip joint pitch-control motor, knee joint steering wheel and ankle-joint steering wheel; Stablizing on left and right directions is determined by left leg hip joint rolling steering wheel, ankle-joint rolling steering wheel.In like manner, right crus of diaphragm driving phase, the stable of fore-and-aft direction is determined by right leg hip joint pitch-control motor, knee joint steering wheel and ankle-joint steering wheel; Stablizing on left and right directions is determined by right leg hip joint rolling steering wheel, ankle-joint rolling steering wheel.According to this design feature, for front and back and the left and right both direction of robot, design two stability controllers respectively.

1) stability contorting of fore-and-aft direction

For robot take the judge of behavior, the robot health deflection angle using attitude sensor information to obtain calculates Reward Program immediately; Definition anterior-posterior approach intensified learning stability controller immediately Reward Program is:

r_{pitch} (t) = [\begin{matrix} a_{1} & a_{2} \end{matrix}] \cdot [\begin{matrix} r_{1} (t) \\ r_{2} (t) \end{matrix}]

Wherein, a ₁, a ₂for Reward Program weights immediately,

r_{1} (t) = \{\begin{matrix} 0 & | θ_{xz} (t) | \leq ϵ \\ - | θ_{xz} (t) | & otherwise \end{matrix}

r_{2} (t) = \{\begin{matrix} 0 & | Δ θ_{xz} (t) | \leq | Δ θ_{xz} (t - 1) | \\ - 1 & otherwise \end{matrix}

2) stability contorting of left and right directions

s _roll(t)＝[θ _{hip_roll}(t),θ _{ankle_roll}(t),θ _yz(t)]

a _roll(t)＝[Δθ _{hip_roll}(t),Δθ _{ankle_roll}(t)]

r_{roll} (t) = [\begin{matrix} a_{1} & a_{2} \end{matrix}] \cdot [\begin{matrix} r_{1} (t) \\ r_{2} (t) \end{matrix}]

Wherein, a ₁, a ₂for Reward Program weights immediately,

r_{1} (t) = \{\begin{matrix} 0 & | θ_{yz} (t) | \leq ϵ \\ - | θ_{yz} (t) | & otherwise \end{matrix}

r_{2} (t) = \{\begin{matrix} 0 & | Δ θ_{yz} (t) | \leq | Δ θ_{yz} (t - 1) | \\ - 1 & otherwise \end{matrix}

3) based on the online actions adjustment stability contorting flow process of RBF-Q Learning framework

In anthropomorphic robot gait processes, each is about to the action of execution, this stability controller gets sensor information from Kalman filtering algorithm, and according to present offline basis gait, calculates current state.2 flow processs with reference to the accompanying drawings, upgrade RBF-Q Learning framework, and obtain next step behavior, revise in real time the action being about to perform.

In sum, following algorithm steps is had for each RBF-Q Learning framework online actions adjustment stability controller:

1. initialization RBF-Q Learning framework.

2. for each the walking action being about to perform, from Kalman filtering blending algorithm, trunk-plumb line angle and angular velocity thereof is got, according to formulae discovery current state.

3. use current state, calculate optimum behavior according to RBF-Q Learning framework.

4. use step 3. gained optimum behavior, revise next walking action.

5. perform next action, obtain current system return value immediately, upgrade RBF-Q Learning framework.Jump to step 2.

(4) experiment test and interpretation of result

1) emulation experiment test and interpretation of result

The online actions adjustment stability controller based on RBF-Q Learning framework is used to carry out the online stability control of anthropomorphic robot walking.Anthropomorphic robot learns in simulated environment, constantly conforms to modify basic gait, until complete continuous walking target.

In the experiment of this group, algorithm all trend convergences after 1000 walkings, and respectively in upward slope landform and accidental relief, complete the continuous walking of 10 step.From experimental result, we can see, use anthropomorphic robot based on the online actions adjustment stability contorting of RBF-Q Learning framework after the study of experience a period of time, possess walking and pass through and go up a slope and the ability of the complicated terrain environment such as rugged.

Figure 3 shows use based on RBF-Q Learning framework online actions adjustment stability contorting anthropomorphic robot upward slope landform up walk angular velocity real-time change.This test data sheet is anthropomorphic robot the 1000th walking learning process, and robot is success walking 10 step continuously in upward slope landform.

Figure 4 shows and use the anthropomorphic robot of the online actions adjustment stability contorting based on RBF-Q Learning framework to walk in accidental relief angular velocity real-time change.This test data sheet is anthropomorphic robot the 1000th walking learning process, and robot is success walking 10 step continuously in accidental relief.

2) tangible machine people experiment test

In real experiment, online actions adjustment stability contorting based on RBF-Q Learning framework is successfully applied on platform anthropomorphic robot, and for successfully completing walking, thus the validity based on the apery robot stabilized control method of RBF-Q Learning framework that checking the present invention proposes.

Claims

1., based on the apery robot stabilized control method of Q learning framework of RBF network, it is characterized in that comprising the steps:

(1) design is based on the Q learning framework (RBF-Q Learning) of RBF network, suppose that Q function receives a state vector s (t) and action vector a (t) input, and export scalar Q (t), specifically comprise:

1) RBF neural design

y_{i} (t) = \exp (- \frac{| | x (t) - μ_{i} (t) | |}{{2 σ}_{i}^{2} (t)}), i = 1,2, . . ., k

Wherein, x is input variable, μ _iand σ _ibe i-th neuronic center and standard deviation respectively, k is RBF activation function number;

Q (t) = Σ_{i = 1}^{k} w_{i} (t) y_{i} (t)

Wherein, w _ibe that i-th neuron exports the weight in Q function;

2) RBF network upgrades

Definition Q learning error δ _q, as follows:

δ _Q＝(1-λ)(r+γQ _max-Q(s,a*,t))

Wherein, λ is Studying factors, 0≤λ≤1; γ is decay factor 0 < γ < 1; Q _maxfor Q maximal value current in iterative process; R is return value immediately; A* represents that optimum action is selected; S is input state; Error delta _qindicate the degree of convergence of Q function in learning process; The learning performance index E defining this system is as follows:

E (t) = \frac{1}{2} δ_{Q}^{2} (t)

w_{i} (t + 1) = w_{i} (t) + α_{w} \frac{&PartialD; E}{&PartialD; w}

Wherein, α _wfor learning rate, for E (t) and w _it (), has:

\frac{&PartialD; E}{&PartialD; δ_{Q}} = \frac{&PartialD; (\frac{1}{2} δ_{Q}^{2})}{{&PartialD; δ}_{Q}} = δ_{Q}

\frac{{&PartialD; δ}_{Q}}{&PartialD; w} = y_{i}

w _i(t+1)＝w _i(t)+α _wδ _Q(t)y _i(t)

μ_{i} (t + 1) = μ_{i} (t) + α_{μ} δ_{Q} (t) w_{i} (t) y_{i} (t) \frac{x_{i} (t) - μ_{i} (t)}{σ_{i}^{2} (t)}

σ_{i} (t + 1) = σ_{i} (t) + α_{σ} δ_{Q} (t) w_{i} (t) y_{i} (t) \frac{{| | x_{i} (t) - μ_{i} (t) | |}^{2}}{σ_{i}^{3} (t)}

3) gradient descent method solves Q and learns next step behavior

Learn for discrete Q, solve max{Q (s (t), b, t)/b ∈ A} by traversal Q table, namely b represents next step optimum behavior, and for the Q function of Continuous behavior, adopts gradient descent method to solve next step behavior;

&dtri; Q (a) = {[- \frac{&PartialD; Q (s (t), a, t)}{{&PartialD; a}_{1}}, . . ., - \frac{&PartialD; Q (s (t), a, t)}{{&PartialD; a}_{m}}]}^{T}

a (i + 1) = a (i) + λ &dtri; Q [a (i)]

2. for a (i), utilize

&dtri; Q (a) = {[- \frac{&PartialD; Q (s (t), a, t)}{{&PartialD; a}_{1}}, . . ., - \frac{&PartialD; Q (s (t), a, t)}{{&PartialD; a}_{m}}]}^{T}

Ask current gradient direction

3. formula is used upgrade and obtain a (i+1);

4. error of calculation Δ E=||a (i-1)-a (i) ||, if Δ E≤Δ E _minor i > k, then stop; Otherwise, make i=i+1, jump to step 2.;

(2) design is based on the online actions adjustment stability controller of RBF-Q Learning framework; For front and back and the left and right both direction of robot, design two stability controllers respectively:

1) stability contorting of fore-and-aft direction

For left foot driving phase, in like manner, for the stability contorting of anthropomorphic robot fore-and-aft direction, the state of definition RBF-QLearning study is input as following right crus of diaphragm:

Wherein, θ _{hip_pitch}(t), θ _{knee_pitch}(t), θ _{ankle_pitch}t () is respectively left foot hip joint pitch-control motor angle, knee joint steering wheel angle and ankle-joint steering wheel angle in the gait of t anthropomorphic robot off-line basis, θ _xzt () is the trunk in t xz plane-plumb line angle;

Wherein, Δ θ _{hip_pitch}(t), Δ θ _{knee_pitch}(t), Δ θ _{ankle_pitch}t () is respectively the adjustment angle of current hip joint pitch-control motor degree, knee joint steering wheel, ankle-joint steering wheel;

For robot take the judge of behavior, the robot health deflection angle that we use attitude sensor information to obtain calculates Reward Program immediately;

r_{pitch} (t) = [\begin{matrix} a_{1} & a_{2} \end{matrix}] \cdot [\begin{matrix} r_{1} (t) \\ r_{2} (t) \end{matrix}]

Wherein, a ₁, a ₂for Reward Program weights immediately,

r_{1} (t) = \{\begin{matrix} 0 & | θ_{xz} (t) | \leq ϵ \\ - | θ_{xz} (t) | & otherwise \end{matrix}

r_{2} (t) = \{\begin{matrix} 0 & | {Δθ}_{xz} (t) | \leq | {Δθ}_{xz} (t - 1) | \\ - 1 & otherwise \end{matrix}

Wherein, ε is allowable error band, θ _xz(t) and Δ θ _xzt () is respectively trunk-plumb line angle in t xz plane and angular velocity thereof; Reward Program is intended to θ immediately _xzt () controls in allowable error band, simultaneously its rate of change Δ θ _xzt () is as far as possible little;

2) stability contorting of left and right directions

s _roll(t)＝[θ _{hip_roll}(t),θ _{ankle_roll}(t),θ _yz(t)]

Wherein, θ _{hip_roll}(t) and θ _{ankle_roll}t () is respectively the angle of left foot hip joint rolling steering wheel, ankle-joint rolling steering wheel in the gait of t anthropomorphic robot off-line basis, θ _yzt () is the trunk in t yz plane-plumb line angle;

a _roll(t)＝[Δθ _{hip_roll}(t),Δθ _{ankle_roll}(t)]

Wherein, Δ θ _{hip_roll}(t) and Δ θ _{ankle_roll}t () is respectively the adjustment angle of hip joint rolling steering wheel, ankle-joint rolling steering wheel;

r_{roll} (t) = [\begin{matrix} a_{1} & a_{2} \end{matrix}] \cdot [\begin{matrix} r_{1} (t) \\ r_{2} (t) \end{matrix}]

Wherein, a ₁, a ₂for Reward Program weights immediately,

r_{1} (t) = \{\begin{matrix} 0 & | θ_{yz} (t) | \leq ϵ \\ - | θ_{yz} (t) | & otherwise \end{matrix}

r_{2} (t) = \{\begin{matrix} 0 & | {Δθ}_{yz} (t) | \leq | {Δθ}_{yz} (t - 1) | \\ - 1 & otherwise \end{matrix}

Wherein, ε is allowable error band, θ _yz(t) and Δ θ _yzt () is respectively trunk-plumb line angle in t yz plane and angular velocity thereof; Reward Program is intended to θ immediately _yzt () controls in allowable error band, simultaneously its rate of change Δ θ _yzt () is as far as possible little.