CN104932264A - Humanoid robot stable control method of RBF-Q learning frame - Google Patents

Humanoid robot stable control method of RBF-Q learning frame Download PDF

Info

Publication number
CN104932264A
CN104932264A CN201510299823.3A CN201510299823A CN104932264A CN 104932264 A CN104932264 A CN 104932264A CN 201510299823 A CN201510299823 A CN 201510299823A CN 104932264 A CN104932264 A CN 104932264A
Authority
CN
China
Prior art keywords
rbf
pitch
partiald
learning
ankle
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201510299823.3A
Other languages
Chinese (zh)
Other versions
CN104932264B (en
Inventor
毕盛
黄铨雍
韦如明
闵华清
董敏
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
South China University of Technology SCUT
Original Assignee
South China University of Technology SCUT
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by South China University of Technology SCUT filed Critical South China University of Technology SCUT
Priority to CN201510299823.3A priority Critical patent/CN104932264B/en
Publication of CN104932264A publication Critical patent/CN104932264A/en
Application granted granted Critical
Publication of CN104932264B publication Critical patent/CN104932264B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Feedback Control In General (AREA)
  • Manipulator (AREA)

Abstract

The invention discloses a humanoid robot stable control method of an RBF-Q learning frame. The method comprises the following steps: the RBF-Q learning frame which solves the problems of state space serialization and behavior space serialization in a Q learning process is brought forward; an online motion adjusting stable control algorithm of the RBF-Q learning is brought forward, loci of the hip joint, the knee joint and the ankle joint of a support leg are generated, and a humanoid robot is controlled to walk stably through calculation of angles of other joints; and finally, the feasibility and validity of an RBF-Q learning frame method are verified on the Vitruvian Man humanoid robot platform designed by the laboratory. The method provided by the invention can generate a stable walking gait of the humanoid robot in an online learning process.

Description

Based on the apery robot stabilized control method of Q learning framework of RBF network
Technical field
The present invention relates to anthropomorphic robot walking stability contorting field, be specifically related to Q learning framework (RBF-Q Learning) the apery robot stabilized control method based on RBF network.
Background technology
It is the complicated control problem of solution one that double feet walking on anthropomorphic robot platform controls its essence of research.And the solution of complicated problem, be generally by carrying out modeling to whole system, solving system equation solves.But in reality, we usually run into such problem, namely problem itself is difficult to accurate model and is described, or the too various complexity of parameter that system relies on, to such an extent as to is difficult to be solved by the mode of solving system equation.Now can by study but not meticulous Modling model solves problem.
The control problem of anthropomorphic robot double feet walking has highly unstable and features such as mission nonlinear, is difficult to by Accurate Model mode, obtains a perfect solution.The method of intensified learning and neural network has been proved to be effective in the control problem of complexity.These methods do not need the deviser of system to the accurate awareness and understanding of system dynamics own profound.By the mode of study, these methods may provide the perfect solution surmounting deviser's ken.Meanwhile, such method has the ability of continuous learning and improvement, as occurring in nature animal by learning and adapt to obtain their most of abilities.
Summary of the invention
The present invention with the walking stability under Humanoid Complicated ground environment for goal in research, the difficulty realizing state space serialization and action space serialization is difficult to for intensified learning Q study, propose a kind of Q learning framework based on RBF network (RBF-Q Learning), and use this Frame Design and achieve anthropomorphic robot walking stable control method, the validity of the method is verified finally by emulation and tangible machine people.
The invention provides the apery robot stabilized control method of Q learning framework based on RBF network, anthropomorphic robot on-line study can be made to produce stable gait planning, thus realize apery robot stabilized walking, comprise following steps:
(1) the Q learning framework (RBF-Q Learning) based on RBF network designs.
The present invention devises one for the Q learning framework of continuous space based on RBF network, and framework uses has the RBF network matching Q function of stronger overall approximation capability, and uses gradient descent method to solve often to walk the maximal value in iteration and optimum behavior.Algorithm can carry out online adjustment in real time and study for the complexity of process problem to RBF network structure and parameter, possesses good generalization ability.
In conjunction with RBF network and Q study, the present invention devises RBF-Q Leanring algorithm frame, approaches matching in conjunction with RBF network to the Q function in Q study.Suppose that Q function receives a state vector s (t) and action vector a (t) input, and export scalar Q (t).
1) RBF neural design
Input layer: s (t) represents the state that in Q study, Q function inputs in t; A (t) represents the action that in Q study, Q function inputs in t;
Hidden layer: y it () is hidden layer RBF activation function, use gaussian kernel function as neuronic RBF activation function; For i-th neuronic RBF activation function, it exports to use following formulae discovery:
y i ( t ) = exp ( - | | x ( t ) - μ i ( t ) | | 2 σ i 2 ( t ) ) , i = 1,2 , . . . , k
Wherein, x is input variable, μ iand σ ibe i-th neuronic center and standard deviation respectively, k is RBF activation function number.
Output layer: Q (t) represents Q function and exports, and uses lower formula to upgrade,
Q ( t ) = Σ i = 1 k w i ( t ) y i ( t )
Wherein, w ibe that i-th neuron exports the weight in Q function.
2) RBF network upgrades
Definition Q learning error δ q, as follows:
δ Q=(1-λ)(r+γQ max-Q(s,a*,t))
Wherein, λ is Studying factors (0≤λ≤1); γ is decay factor 0 < γ < 1; Q maxfor Q maximal value current in iterative process; R is return value immediately; A* represents that optimum action is selected; S is input state; Error delta qindicate the degree of convergence of Q function in learning process; The learning performance index E defining this system is as follows:
E ( t ) = 1 2 &delta; Q 2 ( t )
Use BP algorithm and gradient descent method, RBF network is upgraded, for each neuronic output weight w i,
Have and upgrade formula as follows:
w i ( t + 1 ) = w i ( t ) + &alpha; w &PartialD; E &PartialD; w
Wherein, α wfor learning rate, for E (t) and w it (), has:
&PartialD; E &PartialD; &delta; Q = &PartialD; ( 1 2 &delta; Q 2 ) &PartialD; &delta; Q = &delta; Q
&PartialD; &delta; Q &PartialD; w = y i
According to chain rule, each neuronic output weight w i, more new formula becomes:
w i(t+1)=w i(t)+α wδ Q(t)y i(t)
For center and the standard deviation μ of each neuron RBF function iand σ i, have and upgrade formula as follows:
&mu; i ( t + 1 ) = &mu; i ( t ) + &alpha; &mu; &delta; Q ( t ) w i ( t ) y i ( t ) x i ( t ) - &mu; i ( t ) &sigma; i 2 ( t )
&sigma; i ( t + 1 ) = &sigma; i ( t ) + &alpha; &sigma; &delta; Q ( t ) w i ( t ) y i ( t ) | | x i ( t ) - &mu; i ( t ) | | 2 &sigma; i 3 ( t )
Wherein, α μand α σbe respectively the learning rate of RBF function center and standard deviation;
3) gradient descent method solves Q and learns next step behavior
Learn for discrete Q, solve max{Q (s (t), b, t)/b ∈ A} by traversal Q table, namely b represents next step optimum behavior, and for the Q function of Continuous behavior, adopts gradient descent method to solve next step behavior,
Minimum problem min{-Q (s (t), b, t)/b ∈ A} can be converted into max{Q (s (t), b, t)/b ∈ A}; Suppose that current state is s (t), for function-Q (s (t), b, t), have gradient direction:
&dtri; Q ( a ) = [ - &PartialD; Q ( s ( t ) , a , t ) &PartialD; a 1 , . . . , - &PartialD; Q ( s ( t ) , a , t ) &PartialD; a m ] T
Solve in iteration in each step, a upgrades in the other direction to gradient, has:
a ( i + 1 ) = a ( i ) + &lambda; a &dtri; Q [ a ( i ) ]
Wherein, λ afor step-length, max{Q (s (t), b, t)/b ∈ A} is solved for gradient descent method, has total algorithm step as follows:
1. initiation parameter, comprising: allowable error Δ E min, maximum iteration time k, step-length λ aand random appointment initial value a (0), make i=0;
2. for a (i), utilize &dtri; Q ( a ) = [ - &PartialD; Q ( s ( t ) , a , t ) &PartialD; a 1 , . . . , - &PartialD; Q ( s ( t ) , a , t ) &PartialD; a m ] T Ask current gradient direction ▽ Q [a (i)];
3. formula is used a ( i + 1 ) = a ( i ) + &lambda; a &dtri; Q [ a ( i ) ] , Upgrade and obtain a (i+1);
4. error of calculation Δ E=||a (i+1)-a (i) ||, if Δ E≤Δ E minor i > k, then stop; Otherwise, make i=i+1, jump to step 2.;
(2) design is based on the online actions adjustment stability controller of RBF-Q Learning framework
For front and back and the left and right both direction of robot, design two stability controllers respectively:
1) stability contorting of fore-and-aft direction
For left foot driving phase (right crus of diaphragm is in like manner), for the stability contorting of anthropomorphic robot fore-and-aft direction, the state of definition RBF-QLearning study is input as following:
s pitch(t)=[θ hip_pitch(t),θ knee_pitch(t),θ ankle_pitch(t),θ xz(t)]
Wherein, θ hip_pitch(t), θ knee_pitch(t), θ ankle_pitcht () is respectively left foot hip joint pitch-control motor angle, knee joint steering wheel angle and ankle-joint steering wheel angle in the gait of t anthropomorphic robot off-line basis, θ xzt () is the trunk in t xz plane-plumb line angle.
Left leg hip joint pitch-control motor, knee joint steering wheel and ankle-joint steering wheel are depended primarily on to fore-and-aft direction stability contorting, therefore output behavior is defined as its on-line tuning value:
a pitch(t)=[Δθ hip_pitch(t),Δθ knee_pitch(t),Δθ ankle_pitch(t)]
Wherein, Δ θ hip_pitch(t), Δ θ knee_pitch(t), Δ θ ankle_pitcht () is respectively the adjustment angle of current hip joint pitch-control motor degree, knee joint steering wheel, ankle-joint steering wheel.
For robot take the judge of behavior, the robot health deflection angle using attitude sensor information to obtain calculates Reward Program immediately.
Definition anterior-posterior approach intensified learning stability controller immediately Reward Program is:
r pitch ( t ) = a 1 a 2 &CenterDot; r 1 ( t ) r 2 ( t )
Wherein, a 1, a 2for Reward Program weights immediately,
r 1 ( t ) = 0 | &theta; xz ( t ) | &le; &epsiv; - | &theta; xz ( t ) | otherwise
r 2 ( t ) = 0 | &Delta; &theta; xz ( t ) | &le; | &Delta; &theta; xz ( t - 1 ) | - 1 otherwise
Wherein, ε is allowable error band, θ xz(t) and Δ θ xzt () is respectively trunk-plumb line angle in t xz plane and angular velocity thereof.Reward Program is intended to θ immediately xzt () controls in allowable error band, simultaneously its rate of change Δ θ xzt () is as far as possible little.
2) stability contorting of left and right directions
For the stability contorting of anthropomorphic robot fore-and-aft direction, equally, the state defining RBF-Q Learning study is input as following:
s roll(t)=[θ hip_roll(t),θ ankle_roll(t),θ yz(t)]
Wherein, θ hip_roll(t) and θ ankle_rollt () is respectively the angle of left foot hip joint rolling steering wheel, ankle-joint rolling steering wheel in the gait of t anthropomorphic robot off-line basis, θ yzt () is the trunk in t yz plane-plumb line angle.
Due to, left leg hip joint rolling steering wheel is then depended primarily on to left and right directions stability contorting, ankle-joint rolling steering wheel determines, therefore output behavior is defined as:
a roll(t)=[Δθ hip_roll(t),Δθ ankle_roll(t)]
Wherein, Δ θ hip_roll(t) and Δ θ ankle_rollt () is respectively the adjustment angle of hip joint rolling steering wheel, ankle-joint rolling steering wheel.
Consider to use the trunk-plumb line angle on z-plane and the stability on angular velocity evaluation left and right directions thereof, definition left and right method intensified learning stability controller immediately Reward Program is:
r roll ( t ) = a 1 a 2 &CenterDot; r 1 ( t ) r 2 ( t )
Wherein, a 1, a 2for Reward Program weights immediately,
r 1 ( t ) = 0 | &theta; yz ( t ) | &le; &epsiv; - | &theta; yz ( t ) | otherwise
r 2 ( t ) = 0 | &Delta; &theta; yz ( t ) | &le; | &Delta; &theta; yz ( t - 1 ) | - 1 otherwise
Wherein, ε is allowable error band, θ yz(t) and Δ θ yzt () is respectively trunk-plumb line angle in t yz plane and angular velocity thereof.Reward Program is intended to θ immediately yzt () controls in allowable error band, simultaneously its rate of change Δ θ yzt () is as far as possible little.
Compared with prior art, the present invention has the following advantages:
(1) Q learning framework (RBF-Q Learning) method based on RBF network makes the stability of robot ambulation be optimized, and possesses online learning ability.After certain study, anthropomorphic robot stably walking can pass through complicated ground environment.
(2) control problem of anthropomorphic robot double feet walking has highly unstable and features such as mission nonlinear, be difficult to by Accurate Model mode, Q learning framework (RBF-Q Learning) method based on RBF network does not need the deviser of system to the accurate awareness and understanding of system dynamics own profound.By mode of learning, the inventive method can provide the perfect solution surmounting deviser's ken.This method has the ability of continuous learning and improvement simultaneously, as occurring in nature animal by learning and adapt to obtain their most of abilities.
Accompanying drawing explanation
Fig. 1 is RBF-Q Learning network structure.
Fig. 2 is RBF-Q Learning algorithm frame schematic flow sheet.
Fig. 3 be use the robot of online actions adjustment stability contorting upward slope landform up walk angular velocity curve (after 1000 walkings, top curve corresponds to robot around y-axis angular velocity (i.e. swing), and the corresponding anthropomorphic robot of lower curve is around x-axis angular velocity (namely swinging) data.
Fig. 4 uses the robot angular velocity curve of walking in accidental relief of online actions adjustment stability contorting (after 1000 walkings, the corresponding robot of top curve is around y-axis angular velocity (i.e. swing) data, and the corresponding anthropomorphic robot of lower curve is around x-axis angular velocity (namely swinging) data).
Embodiment
Describe the specific embodiment of the present invention in detail below in conjunction with accompanying drawing, but enforcement of the present invention and protection are not limited thereto, if it is noted that have symbol or the process of special detailed description below, be all that those skilled in the art can refer to existing techniques in realizing.
(1) by using 3 dimension inverted pendulum models to carry out ZMP analysis to the anthropomorphic robot model simplified, robot barycenter and foothold track in gait processes is calculated.Use robot barycenter and foothold track, by reverse movement Epidemiological Analysis, we obtain each joint motions track in anthropomorphic robot gait processes, and preserve as the basic gait information of robot off-line.
(2) the Q learning framework (RBF-Q Learning) based on RBF network designs.
1) the Q function of RBF network matching
In conjunction with RBF network, matching is approached to the Q function in Q study.Suppose that Q function receives a state vector s (t) and action vector a (t) input, and export scalar Q (t), have RBF neural as follows, see accompanying drawing 1.
Input layer: s (t) represents the state that in Q study, Q function inputs in t, altogether n dimension; A (t) represents the action that in Q study, Q function inputs in t, altogether m dimension.
Hidden layer: y (t) is hidden layer RBF activation function, k altogether.Use gaussian kernel function as neuronic RBF activation function; For i-th neuronic RBF activation function, it exports to use following formulae discovery:
y i ( t ) = exp ( - | | x ( t ) - &mu; i ( t ) | | 2 &sigma; i 2 ( t ) ) , i = 1,2 , . . . , k
Wherein, x is input variable, μ iand σ ibe i-th neuronic center and standard deviation respectively, k is RBF activation function number.
Output layer: Q (t) represents Q function and exports, and uses lower formula to upgrade,
Q ( t ) = &Sigma; i = 1 k w i ( t ) y i ( t )
Wherein, w ibe that i-th neuron exports the weight in Q function.
For the renewal of this RBF network, definition Q learning error δ q, as follows:
δ Q=(1-λ)(r+γQ max-Q(s,a*,t))
Wherein, λ is Studying factors (0≤λ≤1); γ is decay factor (0 < γ < 1); Q maxfor Q maximal value current in iterative process; R is return value immediately; A* represents that optimum action is selected; S is input state; Error delta qindicate the degree of convergence of Q function in learning process; The learning performance index E defining this system is as follows:
E ( t ) = 1 2 &delta; Q 2 ( t )
Use BP algorithm and gradient descent method, RBF network is upgraded, for each neuronic output weight w i, have and upgrade formula as follows:
w i ( t + 1 ) = w i ( t ) + &alpha; w &PartialD; E &PartialD; w
Wherein, α wfor learning rate, for E (t) and w it (), has:
&PartialD; E &PartialD; &delta; Q = &PartialD; ( 1 2 &delta; Q 2 ) &PartialD; &delta; Q = &delta; Q
&PartialD; &delta; Q &PartialD; w = y i
According to chain rule, each neuronic output weight w i, more new formula becomes:
w i(t+1)=w i(t)+α wδ Q(t)y i(t)
For center and the standard deviation μ of each neuron RBF function iand σ i, have and upgrade formula as follows:
&mu; i ( t + 1 ) = &mu; i ( t ) + &alpha; &mu; &delta; Q ( t ) w i ( t ) y i ( t ) x i ( t ) - &mu; i ( t ) &sigma; i 2 ( t )
&sigma; i ( t + 1 ) = &sigma; i ( t ) + &alpha; &sigma; &delta; Q ( t ) w i ( t ) y i ( t ) | | x i ( t ) - &mu; i ( t ) | | 2 &sigma; i 3 ( t )
Wherein, α μand α σbe respectively the learning rate of RBF function center and standard deviation.
2) gradient descent method solves Q and learns next step behavior
Learn for discrete Q, solve max{Q (s (t), b, t)/b ∈ A} by traversal Q table, namely b represents next step optimum behavior, and for the Q function of Continuous behavior, adopts gradient descent method to solve next step behavior.
Minimum problem min{-Q (s (t), b, t)/b ∈ A} can be converted into max{Q (s (t), b, t)/b ∈ A}; Suppose that current state is s (t), for function-Q (s (t), b, t), have gradient direction:
&dtri; Q ( a ) = [ - &PartialD; Q ( s ( t ) , a , t ) &PartialD; a 1 , . . . , - &PartialD; Q ( s ( t ) , a , t ) &PartialD; a m ] T
Solve in iteration in each step, a upgrades in the other direction to gradient, has:
a ( i + 1 ) = a ( i ) + &lambda; a &dtri; Q [ a ( i ) ]
Wherein, λ afor step-length, max{Q (s (t), b, t)/b ∈ A} is solved for gradient descent method, has total algorithm step as follows:
1. initiation parameter, comprising: allowable error Δ E min, maximum iteration time k, step-length λ aand random appointment initial value a (0), make i=0;
2. for a (i), utilize &dtri; Q ( a ) = [ - &PartialD; Q ( s ( t ) , a , t ) &PartialD; a 1 , . . . , - &PartialD; Q ( s ( t ) , a , t ) &PartialD; a m ] T Ask current gradient direction
3. formula is used a ( i + 1 ) = a ( i ) + &lambda; a &dtri; Q [ a ( i ) ] Upgrade and obtain a (i+1);
4. error of calculation Δ E=||a (i+1)-a (i) ||, if Δ E≤Δ E minor i > k, then stop; Otherwise, make i=i+1, jump to step 2..
In conjunction with RBF neural and gradient descent method, we do a whole description to RBF-Q Learning algorithm frame, and algorithm flow chart is as accompanying drawing 2.
(3) design is based on the online actions adjustment stability controller of RBF-Q Learning framework
Design the state input for anthropomorphic robot walking RBF-Q Learning learning framework and behavior output.Anthropomorphic robot double feet walking process is the process that two different walkings are changed mutually, (being that right crus of diaphragm is taken a step as example with the first step), namely right crus of diaphragm driving phase is transformed into from left foot driving phase, and circulate with this, phase transition centre is generally also interspersed with an of short duration biped driving phase.At left foot driving phase, left foot supports the three-dimensional inverted pendulum model formed and stablizes primarily of left foot servos control, and now, robot in the longitudinal direction stable is determined by left leg hip joint pitch-control motor, knee joint steering wheel and ankle-joint steering wheel; Stablizing on left and right directions is determined by left leg hip joint rolling steering wheel, ankle-joint rolling steering wheel.In like manner, right crus of diaphragm driving phase, the stable of fore-and-aft direction is determined by right leg hip joint pitch-control motor, knee joint steering wheel and ankle-joint steering wheel; Stablizing on left and right directions is determined by right leg hip joint rolling steering wheel, ankle-joint rolling steering wheel.According to this design feature, for front and back and the left and right both direction of robot, design two stability controllers respectively.
1) stability contorting of fore-and-aft direction
For left foot driving phase (right crus of diaphragm is in like manner), for the stability contorting of anthropomorphic robot fore-and-aft direction, the state of definition RBF-QLearning study is input as following:
s pitch(t)=[θ hip_pitch(t),θ knee_pitch(t),θ ankle_pitch(t),θ xz(t)]
Wherein, θ hip_pitch(t), θ knee_pitch(t), θ ankle_pitcht () is respectively left foot hip joint pitch-control motor angle, knee joint steering wheel angle and ankle-joint steering wheel angle in the gait of t anthropomorphic robot off-line basis, θ xzt () is the trunk in t xz plane-plumb line angle.
Left leg hip joint pitch-control motor, knee joint steering wheel and ankle-joint steering wheel are depended primarily on to fore-and-aft direction stability contorting, therefore output behavior is defined as its on-line tuning value:
a pitch(t)=[Δθ hip_pitch(t),Δθ knee_pitch(t),Δθ ankle_pitch(t)]
Wherein, Δ θ hip_pitch(t), Δ θ knee_pitch(t), Δ θ ankle_pitcht () is respectively the adjustment angle of current hip joint pitch-control motor degree, knee joint steering wheel, ankle-joint steering wheel.
For robot take the judge of behavior, the robot health deflection angle using attitude sensor information to obtain calculates Reward Program immediately; Definition anterior-posterior approach intensified learning stability controller immediately Reward Program is:
r pitch ( t ) = a 1 a 2 &CenterDot; r 1 ( t ) r 2 ( t )
Wherein, a 1, a 2for Reward Program weights immediately,
r 1 ( t ) = 0 | &theta; xz ( t ) | &le; &epsiv; - | &theta; xz ( t ) | otherwise
r 2 ( t ) = 0 | &Delta; &theta; xz ( t ) | &le; | &Delta; &theta; xz ( t - 1 ) | - 1 otherwise
Wherein, ε is allowable error band, θ xz(t) and Δ θ xzt () is respectively trunk-plumb line angle in t xz plane and angular velocity thereof.Reward Program is intended to θ immediately xzt () controls in allowable error band, simultaneously its rate of change Δ θ xzt () is as far as possible little.
2) stability contorting of left and right directions
For the stability contorting of anthropomorphic robot fore-and-aft direction, equally, the state defining RBF-Q Learning study is input as following:
s roll(t)=[θ hip_roll(t),θ ankle_roll(t),θ yz(t)]
Wherein, θ hip_roll(t) and θ ankle_rollt () is respectively the angle of left foot hip joint rolling steering wheel, ankle-joint rolling steering wheel in the gait of t anthropomorphic robot off-line basis, θ yzt () is the trunk in t yz plane-plumb line angle.
Due to, left leg hip joint rolling steering wheel is then depended primarily on to left and right directions stability contorting, ankle-joint rolling steering wheel determines, therefore output behavior is defined as:
a roll(t)=[Δθ hip_roll(t),Δθ ankle_roll(t)]
Wherein, Δ θ hip_roll(t) and Δ θ ankle_rollt () is respectively the adjustment angle of hip joint rolling steering wheel, ankle-joint rolling steering wheel.
Consider to use the trunk-plumb line angle on z-plane and the stability on angular velocity evaluation left and right directions thereof, definition left and right method intensified learning stability controller immediately Reward Program is:
r roll ( t ) = a 1 a 2 &CenterDot; r 1 ( t ) r 2 ( t )
Wherein, a 1, a 2for Reward Program weights immediately,
r 1 ( t ) = 0 | &theta; yz ( t ) | &le; &epsiv; - | &theta; yz ( t ) | otherwise
r 2 ( t ) = 0 | &Delta; &theta; yz ( t ) | &le; | &Delta; &theta; yz ( t - 1 ) | - 1 otherwise
Wherein, ε is allowable error band, θ yz(t) and Δ θ yzt () is respectively trunk-plumb line angle in t yz plane and angular velocity thereof.Reward Program is intended to θ immediately yzt () controls in allowable error band, simultaneously its rate of change Δ θ yzt () is as far as possible little.
3) based on the online actions adjustment stability contorting flow process of RBF-Q Learning framework
In anthropomorphic robot gait processes, each is about to the action of execution, this stability controller gets sensor information from Kalman filtering algorithm, and according to present offline basis gait, calculates current state.2 flow processs with reference to the accompanying drawings, upgrade RBF-Q Learning framework, and obtain next step behavior, revise in real time the action being about to perform.
In sum, following algorithm steps is had for each RBF-Q Learning framework online actions adjustment stability controller:
1. initialization RBF-Q Learning framework.
2. for each the walking action being about to perform, from Kalman filtering blending algorithm, trunk-plumb line angle and angular velocity thereof is got, according to formulae discovery current state.
3. use current state, calculate optimum behavior according to RBF-Q Learning framework.
4. use step 3. gained optimum behavior, revise next walking action.
5. perform next action, obtain current system return value immediately, upgrade RBF-Q Learning framework.Jump to step 2.
(4) experiment test and interpretation of result
1) emulation experiment test and interpretation of result
The online actions adjustment stability controller based on RBF-Q Learning framework is used to carry out the online stability control of anthropomorphic robot walking.Anthropomorphic robot learns in simulated environment, constantly conforms to modify basic gait, until complete continuous walking target.
In the experiment of this group, algorithm all trend convergences after 1000 walkings, and respectively in upward slope landform and accidental relief, complete the continuous walking of 10 step.From experimental result, we can see, use anthropomorphic robot based on the online actions adjustment stability contorting of RBF-Q Learning framework after the study of experience a period of time, possess walking and pass through and go up a slope and the ability of the complicated terrain environment such as rugged.
Figure 3 shows use based on RBF-Q Learning framework online actions adjustment stability contorting anthropomorphic robot upward slope landform up walk angular velocity real-time change.This test data sheet is anthropomorphic robot the 1000th walking learning process, and robot is success walking 10 step continuously in upward slope landform.
Figure 4 shows and use the anthropomorphic robot of the online actions adjustment stability contorting based on RBF-Q Learning framework to walk in accidental relief angular velocity real-time change.This test data sheet is anthropomorphic robot the 1000th walking learning process, and robot is success walking 10 step continuously in accidental relief.
2) tangible machine people experiment test
In real experiment, online actions adjustment stability contorting based on RBF-Q Learning framework is successfully applied on platform anthropomorphic robot, and for successfully completing walking, thus the validity based on the apery robot stabilized control method of RBF-Q Learning framework that checking the present invention proposes.

Claims (1)

1., based on the apery robot stabilized control method of Q learning framework of RBF network, it is characterized in that comprising the steps:
(1) design is based on the Q learning framework (RBF-Q Learning) of RBF network, suppose that Q function receives a state vector s (t) and action vector a (t) input, and export scalar Q (t), specifically comprise:
1) RBF neural design
Input layer: s (t) represents the state that in Q study, Q function inputs in t; A (t) represents the action that in Q study, Q function inputs in t;
Hidden layer: y it () is hidden layer RBF activation function, use gaussian kernel function as neuronic RBF activation function; For i-th neuronic RBF activation function, it exports to use following formulae discovery:
y i ( t ) = exp ( - | | x ( t ) - &mu; i ( t ) | | 2 &sigma; i 2 ( t ) ) , i = 1,2 , . . . , k
Wherein, x is input variable, μ iand σ ibe i-th neuronic center and standard deviation respectively, k is RBF activation function number;
Output layer: Q (t) represents Q function and exports, and uses lower formula to upgrade,
Q ( t ) = &Sigma; i = 1 k w i ( t ) y i ( t )
Wherein, w ibe that i-th neuron exports the weight in Q function;
2) RBF network upgrades
Definition Q learning error δ q, as follows:
δ Q=(1-λ)(r+γQ max-Q(s,a*,t))
Wherein, λ is Studying factors, 0≤λ≤1; γ is decay factor 0 < γ < 1; Q maxfor Q maximal value current in iterative process; R is return value immediately; A* represents that optimum action is selected; S is input state; Error delta qindicate the degree of convergence of Q function in learning process; The learning performance index E defining this system is as follows:
E ( t ) = 1 2 &delta; Q 2 ( t )
Use BP algorithm and gradient descent method, RBF network is upgraded, for each neuronic output weight w i, have and upgrade formula as follows:
w i ( t + 1 ) = w i ( t ) + &alpha; w &PartialD; E &PartialD; w
Wherein, α wfor learning rate, for E (t) and w it (), has:
&PartialD; E &PartialD; &delta; Q = &PartialD; ( 1 2 &delta; Q 2 ) &PartialD; &delta; Q = &delta; Q
&PartialD; &delta; Q &PartialD; w = y i
According to chain rule, each neuronic output weight w i, more new formula becomes:
w i(t+1)=w i(t)+α wδ Q(t)y i(t)
For center and the standard deviation μ of each neuron RBF function iand σ i, have and upgrade formula as follows:
&mu; i ( t + 1 ) = &mu; i ( t ) + &alpha; &mu; &delta; Q ( t ) w i ( t ) y i ( t ) x i ( t ) - &mu; i ( t ) &sigma; i 2 ( t )
&sigma; i ( t + 1 ) = &sigma; i ( t ) + &alpha; &sigma; &delta; Q ( t ) w i ( t ) y i ( t ) | | x i ( t ) - &mu; i ( t ) | | 2 &sigma; i 3 ( t )
Wherein, α μand α σbe respectively the learning rate of RBF function center and standard deviation;
3) gradient descent method solves Q and learns next step behavior
Learn for discrete Q, solve max{Q (s (t), b, t)/b ∈ A} by traversal Q table, namely b represents next step optimum behavior, and for the Q function of Continuous behavior, adopts gradient descent method to solve next step behavior;
Minimum problem min{-Q (s (t), b, t)/b ∈ A} can be converted into max{Q (s (t), b, t)/b ∈ A}; Suppose that current state is s (t), for function-Q (s (t), b, t), have gradient direction:
&dtri; Q ( a ) = [ - &PartialD; Q ( s ( t ) , a , t ) &PartialD; a 1 , . . . , - &PartialD; Q ( s ( t ) , a , t ) &PartialD; a m ] T
Solve in iteration in each step, a upgrades in the other direction to gradient, has:
a ( i + 1 ) = a ( i ) + &lambda; &dtri; Q [ a ( i ) ]
Wherein, λ afor step-length, max{Q (s (t), b, t)/b ∈ A} is solved for gradient descent method, has total algorithm step as follows:
1. initiation parameter, comprising: allowable error Δ E min, maximum iteration time k, step-length λ aand random appointment initial value a (0), make i=0;
2. for a (i), utilize &dtri; Q ( a ) = [ - &PartialD; Q ( s ( t ) , a , t ) &PartialD; a 1 , . . . , - &PartialD; Q ( s ( t ) , a , t ) &PartialD; a m ] T Ask current gradient direction
3. formula is used upgrade and obtain a (i+1);
4. error of calculation Δ E=||a (i-1)-a (i) ||, if Δ E≤Δ E minor i > k, then stop; Otherwise, make i=i+1, jump to step 2.;
(2) design is based on the online actions adjustment stability controller of RBF-Q Learning framework; For front and back and the left and right both direction of robot, design two stability controllers respectively:
1) stability contorting of fore-and-aft direction
For left foot driving phase, in like manner, for the stability contorting of anthropomorphic robot fore-and-aft direction, the state of definition RBF-QLearning study is input as following right crus of diaphragm:
s pitch(t)=[θ hip_pitch(t),θ knee_pitch(t),θ ankle_pitch(t),θ xz(t)]
Wherein, θ hip_pitch(t), θ knee_pitch(t), θ ankle_pitcht () is respectively left foot hip joint pitch-control motor angle, knee joint steering wheel angle and ankle-joint steering wheel angle in the gait of t anthropomorphic robot off-line basis, θ xzt () is the trunk in t xz plane-plumb line angle;
Left leg hip joint pitch-control motor, knee joint steering wheel and ankle-joint steering wheel are depended primarily on to fore-and-aft direction stability contorting, therefore output behavior is defined as its on-line tuning value:
a pitch(t)=[Δθ hip_pitch(t),Δθ knee_pitch(t),Δθ ankle_pitch(t)]
Wherein, Δ θ hip_pitch(t), Δ θ knee_pitch(t), Δ θ ankle_pitcht () is respectively the adjustment angle of current hip joint pitch-control motor degree, knee joint steering wheel, ankle-joint steering wheel;
For robot take the judge of behavior, the robot health deflection angle that we use attitude sensor information to obtain calculates Reward Program immediately;
Definition anterior-posterior approach intensified learning stability controller immediately Reward Program is:
r pitch ( t ) = a 1 a 2 &CenterDot; r 1 ( t ) r 2 ( t )
Wherein, a 1, a 2for Reward Program weights immediately,
r 1 ( t ) = 0 | &theta; xz ( t ) | &le; &epsiv; - | &theta; xz ( t ) | otherwise
r 2 ( t ) = 0 | &Delta;&theta; xz ( t ) | &le; | &Delta;&theta; xz ( t - 1 ) | - 1 otherwise
Wherein, ε is allowable error band, θ xz(t) and Δ θ xzt () is respectively trunk-plumb line angle in t xz plane and angular velocity thereof; Reward Program is intended to θ immediately xzt () controls in allowable error band, simultaneously its rate of change Δ θ xzt () is as far as possible little;
2) stability contorting of left and right directions
For the stability contorting of anthropomorphic robot fore-and-aft direction, equally, the state defining RBF-Q Learning study is input as following:
s roll(t)=[θ hip_roll(t),θ ankle_roll(t),θ yz(t)]
Wherein, θ hip_roll(t) and θ ankle_rollt () is respectively the angle of left foot hip joint rolling steering wheel, ankle-joint rolling steering wheel in the gait of t anthropomorphic robot off-line basis, θ yzt () is the trunk in t yz plane-plumb line angle;
Due to, left leg hip joint rolling steering wheel is then depended primarily on to left and right directions stability contorting, ankle-joint rolling steering wheel determines, therefore output behavior is defined as:
a roll(t)=[Δθ hip_roll(t),Δθ ankle_roll(t)]
Wherein, Δ θ hip_roll(t) and Δ θ ankle_rollt () is respectively the adjustment angle of hip joint rolling steering wheel, ankle-joint rolling steering wheel;
Consider to use the trunk-plumb line angle on z-plane and the stability on angular velocity evaluation left and right directions thereof, definition left and right method intensified learning stability controller immediately Reward Program is:
r roll ( t ) = a 1 a 2 &CenterDot; r 1 ( t ) r 2 ( t )
Wherein, a 1, a 2for Reward Program weights immediately,
r 1 ( t ) = 0 | &theta; yz ( t ) | &le; &epsiv; - | &theta; yz ( t ) | otherwise
r 2 ( t ) = 0 | &Delta;&theta; yz ( t ) | &le; | &Delta;&theta; yz ( t - 1 ) | - 1 otherwise
Wherein, ε is allowable error band, θ yz(t) and Δ θ yzt () is respectively trunk-plumb line angle in t yz plane and angular velocity thereof; Reward Program is intended to θ immediately yzt () controls in allowable error band, simultaneously its rate of change Δ θ yzt () is as far as possible little.
CN201510299823.3A 2015-06-03 2015-06-03 The apery robot stabilized control method of Q learning frameworks based on RBF networks Expired - Fee Related CN104932264B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201510299823.3A CN104932264B (en) 2015-06-03 2015-06-03 The apery robot stabilized control method of Q learning frameworks based on RBF networks

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201510299823.3A CN104932264B (en) 2015-06-03 2015-06-03 The apery robot stabilized control method of Q learning frameworks based on RBF networks

Publications (2)

Publication Number Publication Date
CN104932264A true CN104932264A (en) 2015-09-23
CN104932264B CN104932264B (en) 2018-07-20

Family

ID=54119479

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201510299823.3A Expired - Fee Related CN104932264B (en) 2015-06-03 2015-06-03 The apery robot stabilized control method of Q learning frameworks based on RBF networks

Country Status (1)

Country Link
CN (1) CN104932264B (en)

Cited By (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106094813A (en) * 2016-05-26 2016-11-09 华南理工大学 It is correlated with based on model humanoid robot gait's control method of intensified learning
CN106094817A (en) * 2016-06-14 2016-11-09 华南理工大学 Intensified learning humanoid robot gait's planing method based on big data mode
CN107292392A (en) * 2017-05-11 2017-10-24 苏州大学 Large-range monitoring method and supervisory-controlled robot based on the double Q study of depth cum rights
CN107292344A (en) * 2017-06-26 2017-10-24 苏州大学 Robot real-time control method based on environment interaction
CN107403049A (en) * 2017-07-31 2017-11-28 山东师范大学 A kind of Q Learning pedestrians evacuation emulation method and system based on artificial neural network
CN108051787A (en) * 2017-12-05 2018-05-18 上海无线电设备研究所 A kind of missile-borne radar flying test method
CN108537379A (en) * 2018-04-04 2018-09-14 北京科东电力控制系统有限责任公司 Adaptive variable weight combination load forecasting method and device
CN108631817A (en) * 2018-05-10 2018-10-09 东北大学 A method of Frequency Hopping Signal frequency range prediction is carried out based on time frequency analysis and radial neural network
CN108873687A (en) * 2018-07-11 2018-11-23 哈尔滨工程大学 A kind of Intelligent Underwater Robot behavior system knot planing method based on depth Q study
CN109348707A (en) * 2016-04-27 2019-02-15 纽拉拉股份有限公司 For the method and apparatus of the Q study trimming experience memory based on deep neural network
CN109827292A (en) * 2019-01-16 2019-05-31 珠海格力电器股份有限公司 Construction method, control method, the household electrical appliances of household electrical appliances adaptive power conservation Controlling model
CN110712201A (en) * 2019-09-20 2020-01-21 同济大学 Robot multi-joint self-adaptive compensation method based on perceptron model and stabilizer
WO2020199648A1 (en) * 2019-04-01 2020-10-08 珠海格力电器股份有限公司 Control method and device for air conditioner
CN113062601A (en) * 2021-03-17 2021-07-02 同济大学 Q learning-based concrete distributing robot trajectory planning method
CN113467235A (en) * 2021-06-10 2021-10-01 清华大学 Biped robot gait control method and control device

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2011065553A (en) * 2009-09-18 2011-03-31 Honda Motor Co Ltd Learning control system and learning control method
JP2011204036A (en) * 2010-03-25 2011-10-13 Institute Of National Colleges Of Technology Japan Experience reinforcement type reinforcement learning system, experience reinforcement type reinforcement learning method and experience reinforcement type reinforcement learning program
CN102402712A (en) * 2011-08-31 2012-04-04 山东大学 Robot reinforced learning initialization method based on neural network
CN103204193A (en) * 2013-04-08 2013-07-17 浙江大学 Under-actuated biped robot walking control method
CN103440366A (en) * 2013-08-05 2013-12-11 广东电网公司电力科学研究院 BP (Back Propagation) neural network-based exhaust dryness computing method of USC (Ultra-Supercritical) turbine
CN103605285A (en) * 2013-11-21 2014-02-26 南京理工大学 Fuzzy nerve network control method for automobile driving robot system
WO2014047142A1 (en) * 2012-09-20 2014-03-27 Brain Corporation Spiking neuron network adaptive control apparatus and methods

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2011065553A (en) * 2009-09-18 2011-03-31 Honda Motor Co Ltd Learning control system and learning control method
JP2011204036A (en) * 2010-03-25 2011-10-13 Institute Of National Colleges Of Technology Japan Experience reinforcement type reinforcement learning system, experience reinforcement type reinforcement learning method and experience reinforcement type reinforcement learning program
CN102402712A (en) * 2011-08-31 2012-04-04 山东大学 Robot reinforced learning initialization method based on neural network
WO2014047142A1 (en) * 2012-09-20 2014-03-27 Brain Corporation Spiking neuron network adaptive control apparatus and methods
CN103204193A (en) * 2013-04-08 2013-07-17 浙江大学 Under-actuated biped robot walking control method
CN103440366A (en) * 2013-08-05 2013-12-11 广东电网公司电力科学研究院 BP (Back Propagation) neural network-based exhaust dryness computing method of USC (Ultra-Supercritical) turbine
CN103605285A (en) * 2013-11-21 2014-02-26 南京理工大学 Fuzzy nerve network control method for automobile driving robot system

Non-Patent Citations (6)

* Cited by examiner, † Cited by third party
Title
刘道远: "基于Q学习的欠驱动双足机器人行走控制研究", 《中国优秀硕士学位论文全文数据库信息科技辑(月刊)》 *
吴洪岩,等: "基于RBFNN的强化学习在机器人导航中的应用", 《吉林大学学报(信息科学版)》 *
吴洪岩: "基于强化学习的自主移动机器人导航研究", 《中国优秀硕士学位论文全文数据库信息科技辑(月刊)》 *
尹俊明,等: "基于RBF-Q学习的四足机器人运动协调控制", 《计算机应用研究》 *
盛维涛: "基于激励学习算法的移动机器人避障规划研究盛维涛", 《中国优秀硕士学位论文全文数据库信息科技辑(月刊)》 *
葛媛,等: "模糊强化学习在机器人导航中的应用", 《信息技术》 *

Cited By (23)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109348707A (en) * 2016-04-27 2019-02-15 纽拉拉股份有限公司 For the method and apparatus of the Q study trimming experience memory based on deep neural network
CN106094813A (en) * 2016-05-26 2016-11-09 华南理工大学 It is correlated with based on model humanoid robot gait's control method of intensified learning
CN106094813B (en) * 2016-05-26 2019-01-18 华南理工大学 Humanoid robot gait's control method based on model correlation intensified learning
CN106094817A (en) * 2016-06-14 2016-11-09 华南理工大学 Intensified learning humanoid robot gait's planing method based on big data mode
CN107292392A (en) * 2017-05-11 2017-10-24 苏州大学 Large-range monitoring method and supervisory-controlled robot based on the double Q study of depth cum rights
CN107292392B (en) * 2017-05-11 2019-11-22 苏州大学 Large-range monitoring method and supervisory-controlled robot based on the double Q study of depth cum rights
CN107292344A (en) * 2017-06-26 2017-10-24 苏州大学 Robot real-time control method based on environment interaction
CN107292344B (en) * 2017-06-26 2020-09-18 苏州大学 Robot real-time control method based on environment interaction
CN107403049A (en) * 2017-07-31 2017-11-28 山东师范大学 A kind of Q Learning pedestrians evacuation emulation method and system based on artificial neural network
CN107403049B (en) * 2017-07-31 2019-03-19 山东师范大学 A kind of Q-Learning pedestrian's evacuation emulation method and system based on artificial neural network
CN108051787A (en) * 2017-12-05 2018-05-18 上海无线电设备研究所 A kind of missile-borne radar flying test method
CN108537379A (en) * 2018-04-04 2018-09-14 北京科东电力控制系统有限责任公司 Adaptive variable weight combination load forecasting method and device
CN108631817A (en) * 2018-05-10 2018-10-09 东北大学 A method of Frequency Hopping Signal frequency range prediction is carried out based on time frequency analysis and radial neural network
CN108631817B (en) * 2018-05-10 2020-05-19 东北大学 Method for predicting frequency hopping signal frequency band based on time-frequency analysis and radial neural network
CN108873687A (en) * 2018-07-11 2018-11-23 哈尔滨工程大学 A kind of Intelligent Underwater Robot behavior system knot planing method based on depth Q study
CN109827292A (en) * 2019-01-16 2019-05-31 珠海格力电器股份有限公司 Construction method, control method, the household electrical appliances of household electrical appliances adaptive power conservation Controlling model
WO2020199648A1 (en) * 2019-04-01 2020-10-08 珠海格力电器股份有限公司 Control method and device for air conditioner
US11965666B2 (en) 2019-04-01 2024-04-23 Gree Electric Appliances, Inc. Of Zhuhai Control method for air conditioner, and device for air conditioner and storage medium
CN110712201A (en) * 2019-09-20 2020-01-21 同济大学 Robot multi-joint self-adaptive compensation method based on perceptron model and stabilizer
CN110712201B (en) * 2019-09-20 2022-09-16 同济大学 Robot multi-joint self-adaptive compensation method based on perceptron model and stabilizer
CN113062601A (en) * 2021-03-17 2021-07-02 同济大学 Q learning-based concrete distributing robot trajectory planning method
CN113467235A (en) * 2021-06-10 2021-10-01 清华大学 Biped robot gait control method and control device
CN113467235B (en) * 2021-06-10 2022-09-02 清华大学 Biped robot gait control method and control device

Also Published As

Publication number Publication date
CN104932264B (en) 2018-07-20

Similar Documents

Publication Publication Date Title
CN104932264A (en) Humanoid robot stable control method of RBF-Q learning frame
EP2017042B1 (en) Motion controller and motion control method for legged walking robot, and robot apparatus
US8417382B2 (en) Control device for legged mobile body
Rodriguez et al. DeepWalk: Omnidirectional bipedal gait by deep reinforcement learning
US8306657B2 (en) Control device for legged mobile robot
US8311677B2 (en) Control device for legged mobile robot
KR101083414B1 (en) Controller of legged mobile robot
US20110022232A1 (en) Control device for mobile body
KR20010050543A (en) Ambulation control apparatus and ambulation control method of robot
KR20050021288A (en) Robot and attitude control method of robot
Pandala et al. Robust predictive control for quadrupedal locomotion: Learning to close the gap between reduced-and full-order models
Atmeh et al. Implementation of an adaptive, model free, learning controller on the Atlas robot
Ahn et al. Data-efficient and safe learning for humanoid locomotion aided by a dynamic balancing model
CN106094817B (en) Intensified learning humanoid robot gait&#39;s planing method based on big data mode
CN114467097A (en) Method for learning parameters of a neural network, for generating trajectories of an exoskeleton and for setting the exoskeleton in motion
Wang et al. Terrain adaptive walking of biped neuromuscular virtual human using deep reinforcement learning
Palmer et al. Intelligent control of high-speed turning in a quadruped
Flad et al. Experimental validation of a driver steering model based on switching of driver specific primitives
Atmeh et al. A neuro-dynamic walking engine for humanoid robots
Kimpara et al. Human model-based active driving system in vehicular dynamic simulation
Chignoli Trajectory optimization for dynamic aerial motions of legged robots
Mehrabi Dynamics and model-based control of electric power steering systems
Matsubara et al. Spatiotemporal synchronization of biped walking patterns with multiple external inputs by style–phase adaptation
Li et al. Cafe-mpc: A cascaded-fidelity model predictive control framework with tuning-free whole-body control
JP5232120B2 (en) Control device for moving body

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20180720